Coder Social home page Coder Social logo

scfasterbert's Introduction

scBERT

python >3.6.8

scBERT as a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq Data

Reliable cell type annotation is a prerequisite for downstream analysis of single-cell RNA sequencing data. Existing annotation algorithms typically suffer from improper handling of batch effect, lack of curated marker gene lists, or difficulty in leveraging the latent gene-gene interaction information. Inspired by large scale pretrained langurage models, we present a pretrained deep neural network-based model scBERT (single-cell Bidirectional Encoder Representations from Transformers) to overcome the above challenges. scBERT follows the state-of-the-art paradigm of pre-train and fine-tune in the deep learning field. In the first phase of scBERT, it obtains a general understanding of gene-gene interaction by being pre-trained on huge amounts of unlabeled scRNA-seq data. The pre-trained scBERT can then be used for the cell annotation task of unseen and user-specific scRNA-seq data through supervised fine-tuning. For more information, please refer to https://www.biorxiv.org/content/10.1101/2021.12.05.471261v1

Install

scipy-1.5.4 torch-1.8.1 numpy-1.19.2 pandas-1.1.5 scanpy-1.7.2 scikit__learn-0.24.2 transformers-4.6.1

Data

The data can be downloaded from these links. If you have any question, please contact [email protected].

https://drive.weixin.qq.com/s?k=AJEAIQdfAAozQt5B8k https://drive.google.com/file/d/1fNZbKx6LPeoS0hbVYJFI8jlDlNctZxlU/view?usp=sharing

Checkpoint

The pre-trained model checkpoint can be downloaded from this link. If you have any question, please contact [email protected].

https://drive.weixin.qq.com/s?k=AJEAIQdfAAoUxhXE7r

Usage

The test single-cell transcriptomics data file should be pre-processed by first revising gene symbols according to NCBI Gene database updated on Jan. 10, 2020, wherein unmatched genes and duplicated genes will be removed. Then the data should be normalized with the sc.pp.normalize_total and sc.pp.log1p method in scanpy (Python package), detailed in preprocess.py.

You can download this repo and run the demo task on your computing machine within about 4 hours.

  • Fine-tune using pre-trained models
python -m torch.distributed.launch finetune.py --data_path "fine-tune_data_path" --model_path "pretrained_model_path"
#The cell type information is stored in 'label' and 'label_dict' files.
  • Predict using fine-tuned models
python predict.py --data_path "test_data_path" --model_path "finetuned_model_path"
#The cell type information will be loaded frome 'label' and 'label_dict' files.
  • Detection of novel cell type

The detection of novel cell type can be done by thresholding the predicted probabilities. (Default threshold=0.5)

python predict.py --data_path "test_data_path" --model_path "finetuned_model_path" --novel_type True --unassign_thres "custom_threshold"  
  • Expected output

The expected output of model inference is the cell type of each individual cell.

  • Guidance for hyperparameter selection

You can select the hyperparameters of the Performer encoder based on your data and task in:

model = PerformerLM(
    num_tokens = 7,
    dim = 200,
    depth = 6,
    heads = 10
)
Hyperparameter Description Default Arbitrary range
num_tokens Number of bins in expression embedding 7 [5, 7, 9]
dim Size of scBERT embedding vector 200 [100, 200]
heads Number of attention heads of Performer 10 [8, 10, 20]
depth Number of Performer encoder layers 6 [4, 6, 8]

Time cost

Typical install time on a "normal" desktop computer is about 30 minutes.

Exptected run time for infering 10,000 cells on a "normal" desktop computer is about 25 minutes.

Disclaimer

This tool is for research purpose and not approved for clinical use.

This is not an official Tencent product.

Coypright

This tool is developed in Tencent AI Lab.

The copyright holder for this project is Tencent AI Lab.

All rights reserved.

Citation

Yang, F., Wang, W., Wang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell (2022). https://doi.org/10.1038/s42256-022-00534-z

scfasterbert's People

Contributors

tencentailabhealthcare avatar theevildoof avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.