Coder Social home page Coder Social logo

zero-shot-scfoundation's Introduction

Foundation models in single-cell biology: evaluating zero-shot capabilities

DOI DOI

This repository contains the code that accompanies our paper, Assessing the limits of zero-shot foundation models in single-cell biology. You can find the preprint of the paper here.

Project overview

In this project, we assess two proposed foundation models in the context of single-cell RNA-seq: Geneformer (pub, code) and scGPT (pub, code). We focus on evaluating the zero-shot capabilities of these models, specifically their ability to generalize beyond their original training objectives. Our evaluation targets two main tasks: cell type clustering and batch integration. In these tasks, we compare the performance of Geneformer and scGPT against two baselines: scVI (pub, code) and a heuristic method that selects highly variable genes (HVGs). We also investigate the performence of the models in reconstructing the gene expression profiles of cells, and compare it against the baselines - such as a mean expression value or average ranking.

Dependencies

Currently the code requires the GPUs supported by flash attention, required for scGPT to run.

GPUs supported by flash attention are:

  • Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100).
  • Turing GPUs (T4, RTX 2080)

Installation

The amount of time that the installation takes depends on (1) whether you chose mamba over conda (former is much faster in my experience), (2) how many dependencies are already present in your environment, (3) the speed of your internet connection, and (4) the speed of your machine. The following steps, took me about 1 hour to complete on a remote HPC with fast internet connection.

Conda / Mamba

You can install the dependencies using conda. To do so, you need to have conda installed on your machine. If you don't have it, you can install it from here.

We strongly recommend using mamba instead of conda, since it is much faster in our experience. If you are starting from scratch, i.e. don't have conda installed, you can install mamba instead of conda by following their guide here.

If you already have conda install and want to benefit from the speed and enhanced experience of mamba, you can do so by running:

# install mamba in your base environment
conda install -c conda-forge mamba

Be warned though, this is not a recommended way by the creators of mamba.

Note: If you installed mamba from scratch, in all commands below you can replace conda with mamba. However, if you just installed mamba in your existing conda install use mamba only for creating the environment.

1. Installing conda environment

# install conda environment from conda_env.yml file
# in this step, you can use mamba instead of conda for speed
conda env create -f envs/conda_env.yml

To activate the environment, run:

# activate conda environment
conda activate sc_foundation_evals

2. Installing scGPT

This can be tricky, as scGPT requires specific flash-attn version, and flash attention can be difficult to install. If you get any issues with installation, check out the instructions from the flash-attn authors here, but bear in mind that they have significantly updated their code with 2.0 release, so the instructions might not entirely work for this version.

# make sure sc_foundation_evals env is activated
# We have found it easier to install flash attention first, and then scGPT
pip install flash-attn==11.0.4 --no-build-isolation
# then install v1.0.6 version of scGPT
pip install git+https://github.com/bowang-lab/[email protected]
pip install wandb

3. Installing Geneformer

pip install git+https://huggingface.co/ctheodoris/Geneformer.git

4. Installing sc_foundation_evals package

And finally, install the sc_foundation_evals package (the code to run evaluations on zero-shot scFoundation models) itself.

cd sc_foundation_evals
pip install .

To run notebooks you also need to have the weights of the models downloaded. scGPT weights are avaialble here and Geneformer weights are available in its repository. As per the instructions in the Geneformer repository, make sure you have git lfs installed before downloading the weights via repository cloning.

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/ctheodoris/Geneformer

Docker

Support for docker is coming soon.

Running the code

Copying this repository

To run the code, you need to clone this repository.

git clone https://github.com/microsoft/zero-shot-scfoundation

And download and unpack the data, stored at figshare (see here for more details).

cd zero-shot-scfoundation
# download and unpack the data
wget https://figshare.com/ndownloader/files/43480497 -O data.zip
unzip data.zip && rm data.zip

Notebooks

To best understand the code and it's organization, please have a look at the notebooks. The notebooks directory currently contains the following notebooks:

Any questions?

If you have any questions, or find any issues with the code, please open an issue in this repository. You can find more information on how to file an issue in here. We also welcome any contributions to the code - be sure to checkout the Contributing section below.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

zero-shot-scfoundation's People

Contributors

kzkedzierska avatar lorinanthony avatar microsoft-github-operations[bot] avatar microsoftopensource avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

zero-shot-scfoundation's Issues

Release of Used Dataset

Hi,
Thank you for the great work! We wonder if you have plans to release all the data you have used in the preprint?

Question about Figure 6c

Screenshot 2023-10-23 at 10 48 29

Quick question about Figure 6c (above). From the paper, it seems that the highly expressed genes are of rank 1.0 and the lowly expressed genes of rank 0.0. This would be in line with the results of scGPT โ€” the model does well at predicting highly expressed genes and poorly at predicting lowly expressed ones.
Now, as far as I understand, in Geneformer, the highly expressed genes are at the front of the list (source) and in your code you eventually divide the rank by the maximum rank (source) so, wouldn't that mean that Geneformer does better at predicting lowly expressed genes and worse at predicting highly expressed ones?

Here are also two plots that I created independently that would confirm my assessment:

  1. Here I'm taking the cross entropy with reduction='none' and then plot the mean cross entropy across a batch of cells. It suggests that the model is less confidence for higher ranked genes.
    image

  2. Here I'm plotting the sliding window accuracy and f1 score (50 genes at a time) from the top of the list to the bottom. It also suggests that the model does better for the highly expressed genes.
    image (8)

Can you confirm that you didn't alter the way Geneformer ranks genes in your assessment?

Assistance Request for Accessing Complete Datasets

Hi kzkedzierska,

I've been exploring your paper and I'm really impressed by your work. Great stuff!

I found the pancreas dataset link in the paper (thanks for that!), but I'm hitting a bit of a wall trying to track down the other four datasets you mentioned. I've followed the references as best as I can, yet it seems like Iโ€™m missing something or maybe they're not available at the referenced locations.

Could you point me in the right direction to find those datasets? Any guidance or direct links would be super helpful.

Thanks a bunch for your time, and for the great contributions you're making to the field!

Cheers,
renly0313

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.