sanderlab / cellbox Goto Github PK

View Code? Open in Web Editor NEW

54.0 54.0 22.0 92.03 MB

CellBox: Interpretable Machine Learning for Perturbation Biology

License: MIT License

Python 6.71% Shell 0.05% Jupyter Notebook 93.25%

cellbox's Introduction

Introduction

This repository contains the code for https://sanderlab.org

How to Edit the sanderlab.org Website

Edit Text

Edit this data file: https://github.com/sanderlab/sanderlab/edit/master/docs/sanderlabdata.json
Make sure the result is valid JSON: https://jsonformatter.curiousconcept.com/#

Edit Images

Images may be added or removed here: https://github.com/sanderlab/sanderlab/tree/master/docs/images NOTE: Ensure images of people are placed in the people folder versus images for research activities.

Deployment

Wait 5-10 minutes for website to be deployed automatically on sanderlab.org with new changes via the GitHub Pages system, if it does not then contact site administrators. NOTE: Only changes in the docs/ folder will trigger re-deployment.

Projects

Linode-Hosted Project List

This is a list of projects that are available on Linode server. They are accesible via either of the following:

http://sanderlab.org/PROJECT_NAME (e.g., http://sanderlab.org/pertbio)
http://projects.sanderlab.org/PROJECT_NAME
http://IP_ADDRESS/PROJECT_NAME

Java-Based

protnet: ???
oncosign: ???
pertbio: https://pubmed.ncbi.nlm.nih.gov/26284497/

Shiny

pancanmet: https://pubmed.ncbi.nlm.nih.gov/29396322/
tumorcomparer: Unpublished
kidneyMetabProject: https://pubmed.ncbi.nlm.nih.gov/26766592/
kircImmuneProject: https://pubmed.ncbi.nlm.nih.gov/27855702/

Javascript (ALL DEPRECATED)

rcytoscapejs: Unpublished
alignmentviewer: Version unpublished

cellbox's People

Contributors

Stargazers

Watchers

Forkers

judyueshen imahabub ishan2601 debanitrkl elamdf palashgarg0109 guptadhaval18 hezbranch rydbergse syssynbio lara-marinelli saisaitian m-philipps anassharabi alevax frenkiboy mustardburger

cellbox's Issues

expr_index

Hi,
A naive question here. What do column 2 and 3 in expr_index.txt indicate?
Thanks,
Xiao

Describe sparse_data Option

Describe how sparse_data can be used for scRNAseq data.

Transformation for Perturbation Strengths

@DesmondYuan How would a drug with the concentration 10uM need to be transformed for use with CellBox? Does CellBox provide code for this?

Questions about train.py

Issue type

Need help

Summary

Some functions in /cellbox/train.py have some ambiguity in what task they perform. These are crucial to understand to reproduce similar results for Pytorch version of CellBox. Therefore, this issue is for resolving the ambiguity.

Details

Line 76 to 79 in train.py, are loss_valid_i and loss_valid_mse_i evaluated on one random batch fetched from args.feed_dicts['valid_set'], or are these losses evaluated on the whole validation set?
The eval_model function returns different values with different calls. At line 101 to 103, it returns both the total and mse loss for args.n_batches_eval number of batches on the validation set. At line 109 to 111, it returns only the mse loss for args.n_batches_eval number of batches on the test set. And at line 262 it returns the expression predictions y_hat for the whole test set. Are all of these statements correct?
The record_eval.csv file generated after training, using the default training arguments and config file as specified in the README (python scripts/main.py -config=configs/Example.random_partition.json), has test_mse column to be None. Is it the expected behaviour of the code?
random_pos.csv, generated after training, stores the index of the perturbation conditions. Does it indicate how the conditions for training, validation, and testing are split?
After each substage, say substage 6, the code generates 6_best.y_hat.loss.csv, containing the expression prediction for perturbation conditions in the test set for all nodes, but it does not indicate which row in this file corresponds to which perturbation condition. How is this file and random_pos.csv related?

[ URGENT GSOC INQUIRY ] - Cellbox installment

Hey there! Just a small issue I have been facing and tried several iterations to diagnose this particular issue.

this is in specific for the macOS environment.

It specifically mentions disabling the use of optimized BLAS and LAPACK by setting their variables to null string.

I am genuinely curious to know where I can perform this particular operation so that I can build this package.

Usage discussion

I have tried to run the command on Binder and got an output as follows.

Sincere apologies for a dumb doubt, but like I am not exactly able to understand how we are able to train a model without exactly giving any precise outputs?

All inputs are welcome..

Thanking you :)

@cannin @DesmondYuan @judyueshen

pip command not working. #2

I have already discussed this issue previously. I have tried both methods of installation.

pip3 as well as pip. But both are persistently giving me errors. Looking forward to know how to work around this error.

Thanking you!

@cannin @DesmondYuan @judyueshen

Numerical instability between Tensorflow and Pytorch

Issue type

Bug or help needed

Relevant package versions

numpy == 1.24.1
tensorflow == 2.11.0
torch == 2.0.1

Python version

3.8.0

Current behaviour

The envelope forms in tensorflow and pytorch (defined here) yield very similar results (their difference between the two outputs is on the scale of 10e-8). However, these differences accumulate after several time steps in the ODE solver, and become very noticeable after around 150 to 200 time steps in the solver.

Code to reproduce

The recommended envelope form for CellBox is the tanh. The code below calculates the output from tensorflow's and pytorch's isolated envelope form set to tanh (defined in KernelConfig). There is no ODE involved yet.

import numpy as np
import tensorflow.compat.v1 as tf
import torch
tf.disable_v2_behavior()

class KernelConfig(object):
    def __init__(self):
        
        self.n_x = 5
        self.envelope_form = "tanh" # options: tanh, polynormial, hill, linear, clip linear
        self.envelope_fn = None
        self.polynomial_k = 2 # larger than 1
        self.ode_degree = 1
        self.envelope = 0
        self.ode_solver = "heun" # options: euler, heun, rk4, midpoint
        self.dT = 0.1
        self.n_T = 1000
        self.gradient_zero_from = None

args = KernelConfig()
W = np.random.normal(loc=0.01, size=(args.n_x, args.n_x))
eps = np.ones((args.n_x, 1), dtype=np.float32)
alpha = np.ones((args.n_x, 1), dtype=np.float32)
y0_np = np.zeros((args.n_x, 1))

# Test the envelope
def tensorflow_envelope():
    from cellbox.kernel import get_envelope
    envelope_fn = get_envelope(args)

    params = {}
    W_copy = np.copy(W)
    params["W"] = tf.convert_to_tensor(W_copy, dtype=tf.float32)
    if args.ode_degree == 1:
        def weighted_sum(x):
            return tf.matmul(params['W'], x)
    
    return envelope_fn(weighted_sum(tf.convert_to_tensor(params["W"], dtype=tf.float32))).eval(session=tf.compat.v1.Session())

def pytorch_get_envelope(args):
    """get the envelope form based on the given argument"""
    if args.envelope_form == 'tanh':
        args.envelope_fn = torch.tanh
    elif args.envelope_form == 'polynomial':
        k = args.polynomial_k
        assert k > 1, "Hill coefficient has to be k>2."
        if k % 2 == 1:  # odd order polynomial equation
            args.envelope_fn = lambda x: x ** k / (1 + torch.abs(x) ** k)
        else:  # even order polynomial equation
            args.envelope_fn = lambda x: x**k/(1+x**k)*torch.sign(x)
    elif args.envelope_form == 'hill':
        k = args.polynomial_k
        assert k > 1, "Hill coefficient has to be k>=2."
        args.envelope_fn = lambda x: 2*(1-1/(1+nn.functional.relu(torch.tensor(x+1)).numpy()**k))-1
    elif args.envelope_form == 'linear':
        args.envelope_fn = lambda x: x
    elif args.envelope_form == 'clip linear':
        args.envelope_fn = lambda x: torch.clamp(x, min=-1, max=1)
    else:
        raise Exception("Illegal envelope function. Choose from [tanh, polynomial/hill]")
    return args.envelope_fn

def pytorch_envelope():
    envelope_fn = pytorch_get_envelope(args)
    params = {}
    W_copy = np.copy(W)
    params["W"] = torch.tensor(W_copy, dtype=torch.float32)
    if args.ode_degree == 1:
        def weighted_sum(x):
            return torch.matmul(params['W'], x)

    return envelope_fn(weighted_sum(torch.tensor(params["W"], dtype=torch.float32))).numpy()

tf_out = tensorflow_envelope()
torch_out = pytorch_envelope()
print(np.abs(tf_out - torch_out))

The output is:

[[0.0000000e+00 1.4901161e-08 0.0000000e+00 0.0000000e+00 0.0000000e+00]
 [5.9604645e-08 0.0000000e+00 5.9604645e-08 2.9802322e-08 5.9604645e-08]
 [1.1920929e-07 0.0000000e+00 9.3132257e-10 0.0000000e+00 2.9802322e-08]
 [2.9802322e-08 1.4901161e-08 5.9604645e-08 1.8626451e-09 5.9604645e-08]
 [5.9604645e-08 5.9604645e-08 5.9604645e-08 0.0000000e+00 0.0000000e+00]]

If using polynomial with args.polynomial_k = 2:

args.envelope_form = "polynomial"
args.polynomial_k = 2
tf_out = tensorflow_envelope()
torch_out = pytorch_envelope()
print(np.abs(tf_out - torch_out))

The output is:

[[0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00]
 [0.0000000e+00 0.0000000e+00 5.9604645e-08 0.0000000e+00 0.0000000e+00]
 [0.0000000e+00 0.0000000e+00 1.4551915e-11 0.0000000e+00 0.0000000e+00]
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00]
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00]]

However, if changing the envelope form to clip linear:

args.envelope_form = "clip linear"
tf_out = tensorflow_envelope()
torch_out = pytorch_envelope()
print(np.abs(tf_out - torch_out))

The output is:

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

This difference might be small, but it adds up within the ODE solver, and causes the final result of the tensorflow and pytorch ODE solver to differ significantly. The same issue persisted when args.envelope_form is set to hill or polynomial. However, when args.envelope_form is set to linear or clip linear, the difference between tensorflow and pytorch ODE solver is exactly 0, leading me to believe the numerical discrepancy of the other envelope functions cause this behaviour.

Solution

Is there a way around this? If two ODE solutions are very different, which one is the correct solution?

Typo error in Readme.md

jovyan@jupyter-dfci-2dcellbox-2d7985pt7y:~$ python scripts/main.py -config=configs/Example.random_partition.json
WARNING: Logging before flag parsing goes to stderr.
W0206 06:36:49.652899 139714245154624 __init__.py:329] Limited tf.compat.v2.summary API due to missing TensorBoard installation.

        version 0.0.2
        -- Jan 20, 2019 --
        * Huge Bug fixed with gradually increasing nT
        * Reorganize utils.py

        version 0.0.3
        -- Jan 21, 2019 --
        * Adding test of convergece

        version 0.0.3.1
        -- Jan 23, 2019 --
        * Roll back to x_0 = 1

        version 0.0.3
        -- Jan 21, 2019 --
        * Roll back 0.0.3

        version 0.0.3.2
        -- Jan 26, 2019 --
        * Adding outputs for test_convergence()

        version 0.0.3.3
        -- Jan 30, 2019 --
        * use last 20 time step for test_convergence()

        version 0.0.3.4
        -- Jan 31, 2019 --
        * use 0.1 as initial values for alpha and eps variable

        version 0.0.4
        -- Feb 11, 2019 --
        * Roll back 0.0.3.3
        * Add constraints on direct regulation from drug nodes to phenotypic nodes

        version 0.0.5
        -- Feb 21, 2019 --
        * Add function to normalize mse loss to different nodes.

        version 0.1.0
        -- Aug 21, 2019 --
        * Re-structure codes for publish.

        version 0.1.1
        -- Oct 4, 2019 --
        * Add new kinetics
        * Add new ODE solvers
        * Add new envelop forms

        version 0.2.0
        -- Feb 26, 2020 --
        * Add support of matrix operation rather than function mapping
        * Roughly 5x faster

        version 0.2.1
        -- Apr 5, 2020 --
        * Reformat for better code style
        * Revise docs

        version 0.2.2
        -- Apr 23, 2020 --
        * Add support to tf.Datasets
        * Add support to tf.sparse
        * Prepare for sparse single-cell data

        version 0.2.3
        -- June 8, 2020 --
        * Add support to L2 loss (alone or together with L1, i.e. elastic net)
        * Clean the example configs folder


Traceback (most recent call last):
  File "scripts/main.py", line 64, in <module>
    cfg = pertbio.config.Config(master_args.experiment_config_path)
  File "/srv/conda/envs/notebook/lib/python3.6/site-packages/pertbio/config.py", line 12, in __init__
    with open(config_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'configs/Example.random_partition.json'

Error in python scripts/main.py -config=configs/Example.random_partition.json

Suggested change: python scripts/main.py -config=configs/example.random_partition.json

pip install seems broken

The install instructions for pip don't seem to work, checked on a mac and linux system.

installing required modules from requirements.txt

After creating new environment. While running cellbox/setup.py, I had to manually install h5py module. So, I think it would be better to mention it in setup.py or install all modules mentioned in requirements.txt while running setup.py.
If you agree. I will send a pull request.

Cellbox not starting on Binder

Cellbox doesn't start in Binder. It gives the error message Failed to connect to event stream.

Rewrite kernel.py in Pytorch and Add Tests

Rewrite kernel.py in Pytorch and add tests

Explain the format of the expr_index.txt in the README

@DesmondYuan @Mustardburger What does the first column mean? Is it being used? The current README says to refer to loo_label.csv but I guess is outdated?

901|1.5        1   0
901|1.5,HN|6   1   1
901|1.5,NT|3   1   1
901|1.5,P6|20  1   0
901|1.5,PLX|60 1   0

Update CellBox README

This issue is for updating the README file for future users to better install and use CellBox.

Cell Viability (and Other Activity Node) Values

Is there a preset range for measures such as cell viability for use with CellBox (i.e., numbers must be between 0 to 100)?

pip command is not working.

Hey there!

I have tried to run the command as suggested in the readme.md but was not able to figure out a way to come across this error.

@cannin @DesmondYuan @judyueshen greatful if you could diagnose this error.

Thanking you!

Problem while setting up Cellbox locally

I'm getting this error when I'm trying to run python3.6 setup.py install in the cellbox folder

Add Function Documentation

Add documentation to the following functions:

FuncA
FuncB

Best practices for using CellBox on different datasets

For external users wanting to use CellBox on their own dataset, what is the best practice to train the model? How many total models, differed by the seed, or --working_index, should be trained before the collection of models achieves statistical power? This question follows the Network Interpretation in the Methods section from the original CellBox paper, when 1000 models were trained for downstream analysis. CellBox and its ODE solver is susceptible to suboptimal weight initialization: setting the wrong random seed (--working_index) while keeping other configs and arguments the same can lead to very different results. Therefore, for new users with a new dataset, should they train only one model or multiple models with different random seeds to yield the best performance?

Inconsistent drug indexing in loo_label.csv, expr_index.txt, and --drug_index argument

Can you provide more information about what each row and index in loo_label.csv and expr_index.txt represents? I believe it is the label of each drug perturbation, because each row in loo_label.csv corresponds to each row in pert.csv and expr.csv, but I cannot tell what the number indices in loo_label.csv represent.

From the paper, there are 12 drugs being tested. The --drug_index argument therefore refers to the drug that is left out during training. I would assume that, for example, when I ran python scripts/main.py -config=configs/Example.leave_one_out.json --drug_index 12, all the rows in pert.csv that belong to the drug at index 12 (indicated in loo_label.csv) are left out in the training set. However, with a closer look, I see that testidx (defined in dataset.py) contains the indices that points to rows in loo_label.csv that has the number 9. Similarly, setting --drug_index 11 points to rows with number 8, and so on. But setting --drug_index from 0 to 7 points correctly to rows in loo_label.csv that have that number.

Can you confirm with me if this is an expected bahaviour? This is important for me to test my pytorch dataloader to confirm it fetches the similar rows in pert.csv as the current tensorflow dataloader.

self.train_y0 not found when setting model to LinReg

When setting the model option in the json config file to LinReg, the program throws an error of self.train_y0 not being defined. This is because self.train_y0 is only defined when the model is set to CellBox, as the build function in CellBox instantiates self.train_y0. Otherwise, in other models, self.train_y0 is never instantiated. This also applies to self.monitor_y0 and self.eval_y0.

Give Better Documentation for Testing

Add example of testing and automation of testing

Add Tests for Dataset Partitioning Code

Add tests for dataset.py

sanderlab / cellbox Goto Github PK

cellbox's Introduction

Introduction

How to Edit the sanderlab.org Website

Edit Text

Edit Images

Deployment

Projects

Linode-Hosted Project List

Java-Based

Shiny

Javascript (ALL DEPRECATED)

cellbox's People

Contributors

Stargazers

Watchers

Forkers

cellbox's Issues

Issue type

Summary

Details

Issue type

Relevant package versions

Python version

Current behaviour

Code to reproduce

Solution

Recommend Projects

Recommend Topics

Recommend Org