Coder Social home page Coder Social logo

raimondilab / cellhit Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mr-fcharles/cellhit

0.0 0.0 0.0 63.23 MB

Code to replicate "Learning and actioning general principles of cancer cell drug sensitivity"

License: GNU General Public License v3.0

Python 100.00%

cellhit's Introduction

Learning and actioning general principles of cancer cell drug sensitivity

Workflow

This repository contains the complete codebase required to reproduce the analysis presented in the paper Learning and actioning general principles of cancer cell drug sensitivity.

Repository Structure

The code is organized into four main directories, each tailored to facilitate specific aspects of the analysis:

  • CellHit: This is a custom library that encapsulates all the functions used throughout the analysis. The library is designed for reusability in further analyses, making it a versatile tool for similar research.

  • scripts: This folder contains various Python scripts that manage tasks ranging from data pre-processing to model training. An additional Markdown file is included in this folder, providing detailed descriptions and usage instructions for each script.

  • AsyncDistribJobs: This directory houses an auxiliary custom library crafted to efficiently manage asynchronous parallel jobs on HPC environments.

  • Data: This directory contains data needed to reproduce all the obtained results

1. Development Environment Setup

This guide provides detailed steps to set up the development environment necessary to replicate the results from our research. The setup includes creating a new Python environment, installing general libraries, large language model (LLM) libraries, and compiling XGBoost with GPU support.

**NOTE: overall time requirments to setup the enviroment may greatly vary from system to system

Creating a New Python Environment

First, create a new environment using Conda and install CUDA toolkit:

conda create -n CellHit python=3.11
conda install -c "nvidia/label/cuda-11.8" cuda-toolkit

General Libraries Installation

Install the following general purpose libraries using pip:

pip install biopython==1.82 \
			SQLAlchemy==2.0.23 \
			tqdm==4.66.1 \
			torch==2.1.2+cu118 \
			torchaudio==2.1.2+cu118 \
			torchvision==0.16.2+cu118 \
			numba==0.58.1 \
            openpyxl==3.1.2

LLM Libraries Installation

Install libraries specifically used for working with large language models:

pip install guidance==0.1.8 \
             openai==0.28.1 \
             requests==2.31.0 \
             transformers==2.31.0 \
             auto-gptq==0.6.0+cu118 \
             optimum==1.16.1 \
             peft==0.7.1

Installing vLLM with CUDA 11.8 requieres a specific procedure:

# Install vLLM with CUDA 11.8
export VLLM_VERSION=0.2.7
export PYTHON_VERSION=311
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl

# Re-install PyTorch with CUDA 11.8
pip uninstall torch -y
pip install torch --upgrade --index-url https://download.pytorch.org/whl/cu118

# Re-install xFormers with CUDA 11.8
pip uninstall xformers -y
pip install --upgrade xformers --index-url https://download.pytorch.org/whl/cu118

For additional informations refer to the original vLLM documentation

Machine Learning Libraries

Install additional machine learning libraries:

pip install scikit-learn==1.3.2 shap==0.43.0 optuna==3.3.0 

Compiling XGBoost with GPU Support

Follow these steps to compile XGBoost with GPU support, using CUDA 11.8 and gcc 10.2.0:

# Obtain the code
git clone --recursive https://github.com/dmlc/xgboost

# Compiling XGBoost
cd xgboost
mkdir build
cd build
cmake .. -DUSE_CUDA=ON
make -

# Then install
cd python-package/
pip install .

For additional informations refer to the XGBoost documentation

Make sure to follow these instructions sequentially to avoid any issues with dependencies and library versions. Once finished reproducing the enviroment, clone the repository with

git clone https://github.com/raimondilab/CellHit.git

2. Getting the data

Most of the data required to replicate the results is in the data folder. However, some files were too large for direct upload to GitHub, particularly

For transcriptomics data for CCLE and TCGA. To access this data, clone the repository and create a transcriptomics folder within the data folder, alongside metadata and reactome. Then download:

If you wish to bypass the celligner steps, a pre-aligned version of these transcriptomics datasets is available here.

For MOA data you can download prism_LLM_drugID_to_genes.json from here and place it the MOA_data inside the data folder

3. Getting LLMs weights

For the anlysis we used a GPTQ quantized version of Mixtral-8x7B-Instruct-v0.1 from Mistral.AI. The scripts search for this model in you home folder. To obtain the weights first install huggingface-cli and login with the following commands:

#instal
pip install -U "huggingface_hub[cli]"

#login
huggingface-cli login

To login you need a User Access Token from your Settings page (more on this on huggingface official documentation). Once logged-in you can obtain the weights of the used LLM with the following command

huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --revision gptq-4bit-32g-actorder_True --local-dir <your_home_folder>

Hardware requirements

Our pipeline has primarily been developed and executed in High Performance Computing (HPC) environments. We have utilized the following types of nodes, which are significantly enhanced by GPU acceleration:

  • Daneel nodes on HPC@SNS, each equipped with 2 Intel Xeon CPUs, 36 cores (18 cores per socket), 1.5 TB of RAM (about 42 GB/core), 6TB local scratch space and 4 Tesla (V100) NVIDIA GPUs with 32 GB of RAM (each);
  • Gaia nodes on HPC@SNS, equipped with 2 AMD EPYC 7352, 48 cores (24 physical cores per socket), 512 GB of RAM (10.6 GB/core), 4 NVIDIA A100 and a local scratch area of ~890 GB;
  • BullSequana X2135 "Da Vinci" nodes on Leonardo Supercomputer @ CINECA, equipped with 1 x CPU Intel Xeon 8358 32 core, 2,6 GHz, 512 (8 x 64) GB RAM DDR4 3200 MHz and 4x Nvidia custom Ampere (A100) GPU 64GB HBM2.

Despite the high-performance hardware requirements for development and training phases, the final trained models are designed to be deployable on standard desktops or laptops without the need for GPU acceleration. These models will be made available upon publication.

Troubleshooting and Support

If you encounter any issues while setting up or using this environment, please do not hesitate to reach out for help or clarification:

  • Open an Issue: For problems or enhancements related to the code, please open an issue directly on the GitHub repository.

  • Contact via Email: If you have specific questions or need further assistance, you can email us at [email protected].

We are committed to providing support and making continuous improvements to this project.

cellhit's People

Contributors

mr-fcharles avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.