Note ๐ง This repository is under construction. This note will disappear as soon as all the all the single-cell transformer paper tables are added.
This repository accompanies Transformers in Single-Cell Omics: A Review and New Perspectives. Please refer to the manuscript for the details.
We provide a curated list of single-cell transformers and their evaluation results. We skip models that work only on bulk or images of slides data and those where transformers are used only as a part of the model. Models focusing on sequential data, such as DNA or protein sequences are omitted too. New entries are added at the top of the corresponding table.
We welcome contributions to this repository. Please open a pull request or an issue if you want to add or edit an entry.
Model | Paper | Code | Omic Modalities | Pre-training Dataset | Input Embedding | Architecture | SSL Tasks | Supervised Tasks | Zero-shot Tasks |
---|---|---|---|---|---|---|---|---|---|
scMulan | ๐Bian et al. 2024 | ๐Github | scRNA-Seq | 10M / cross-tissue, human (hECA) | Not specified | Decoder | Conditional cell generation | cell type annotation, cell metadata annotation (both also used in training) | Batch integration |
BioFormers | ๐Belgadi and Li et al. 2023 | None | scRNA-Seq | 8K / single tissue, human (PBMC, Adamson et al. 2016) | Value categorization: value binning | Encoder | MLM with CE loss | None | Cell clustering, gene expression imputation, genetic perturbation effect prediction, GRN inference |
Geneformer | ๐(Nature)Theodoris et al. 2023 | ๐ ๐ค | scRNA-Seq | 36M / cross-tissue, human (Genecorpus) | Ordering: rank-based | Encoder | MLM with CE loss, gene ID prediction | Gene function prediction, cell annotation | Cell clustering, GRN inference |
Universal Cell Embedding | ๐Rosen et al. 2023 | ๐Github | scRNA-Seq | 36M / cross-tissue, cross-species (CELLxGENE and other) | Other: ESM-2 based gene embeddings. Gene embeddings are sampled according to expression levels and order determined by position on chromosomes. | Encoder | Modified MLM, binary CE loss predicting whether a gene is expressed or not. Uses CLS embedding instead of token-embeddings. | Cell annotation | Cell clustering, cross-species integration |
scGPT | ๐(Nature Meth)Cui et al. 2024 | ๐GitHub | scRNA-Seq, scATAC-Seq, CITE-Seq, Spatial transcriptomics | 33M / cross-tissue, human, non-disease (CELLxGENE) | Value categorization: value binning | Other: attention masking in encoder | Iterative MLM variant with MSE loss, cell token expression prediction, gene expression prediction | Cell type annotation, genetic perturbation effect prediction, reverse perturbation prediction, cell clustering, multimodal embedding, gene function prediction | Cell clustering, GRN inference, simulation, gene expression imputation |
TOSICA | ๐(Nature Comms)Chen et al. 2023 | ๐ ๏ธGitHub | scRNA-Seq | None | Value projection | Encoder | None | Cell type annotation | None |
scMoFormer | ๐(ACM)Tang et al. 2023 | ๐ ๏ธGitHub | scRNA-Seq, scATAC-Seq, CITE-Seq | None | Other, SVD-based | Encoder and graph transformers | None | Cross-modality prediction | None |
tGPT | ๐(Cell iScience)Shen et al. 2023 | ๐ GitHub๏ธ | scRNA-Seq | 22M / cross-tissue, cross-species, disease and non-disease, organoids (list) | Ordering | Decoder | NTP with CE loss, gene ID prediction | None | Cell clustering, trajectory inference |
SpaFormer | ๐Wen et al. 2023 | ๐ ๏ธGitHub | Spatial transcriptomics | None | Cells as tokens, value projection | Encoder | Modified MLM with MSE loss, gene expression prediction | Gene expression imputation | Cell clustering |
scFoundation | ๐Hao et al. 2023 and Gong et al. 2023 | ๐GitHub | scRNA-Seq | 50M / cross-tissue, human, disease and non-disease (GEO, Single Cell Portal, HCA, EMBL-EBI) | Value projection | Other: two encoders | Modified MLM with MSE loss, gene expression prediction | Drug response prediction, genetic perturbation effect prediction | Read depth enhancement, cell clustering |
CellLM | ๐Zhao et al. 2023 | ๐GitHub | scRNA-Seq | 1.8M / cross-tissue, human, disease and non-disease (PanglaoDB, CancerSCEM) | Value categorization | Encoder | Contrastive loss, MLM with CE loss | Non-disease vs cancer prediction, cell type annotation, drug response prediction | None |
scCLIP | ๐Xiong et al. 2023 | ๐ ๏ธGitHub | scRNA-Seq, scATAC-seq | 377k / cross-tissue, human fetal (ATAC, RNA) | Value projection | Encoder | Contrastive loss, CE matching modalities | None | Multimodal embedding |
GeneCompass | ๐Yang et al. 2023 | GitHub, no code yet | scRNA-Seq | 126M / cross-tissue, human and mouse, disease and non-disease (GEO, SRA, CELLxGENE, GSA, Single Cell Portal, HCA, EMBL-EBI, 3CA, Cell BLAST, TEDD, and other) | ? | Other: two encoders | MLM with CE and MSE loss, gene ID and expression prediction | Cell type annotation, drug response prediction, gene function prediction | Cross-species integration, genetic perturbation effect prediction, GRN inference |
CellPLM | ๐(ICLR)Wen et al. 2023 | Partial ๐GitHub | scRNA-Seq, Spatial transcriptomics | 11M / cross-tissue, human, disease and non-disease (HTCA, HCA, GEO) | Cells as tokens, value projection | Encoder | Modified MLM with MSE loss and KL losses, gene expression prediction | Gene expression imputation, cell type annotation, genetic perturbation effect prediction | Cell clustering, scRNA-Seq denoising |
scMAE | ๐Kim et al. 2023 | None | single-cell flow cytometry | 6.5M / human, disease and non-disease (source?) | Other, concatenation of values with learnable protein embeddings | Other: two encoders | MLM with MSE loss, protein expression prediction | Cell type annotation, protein expression imputation | None |
CAN/CGRAN | ๐Wang et al. 2023 | None | scRNA-Seq | None | Value projection | Encoder | None | Cell type annotation | None |
scTranslator | ๐Liu et al. 2023 | ๐๏ธGitHub | scRNA-Seq, CITE-Seq | None | Value projection | Other: two encoders | None | Cross-modality prediction | (After cross-modality prediction training) GRN inference, cell clustering |
scTransSort | ๐(MDPI)Jiao et al. 2023 | ๐ ๏ธGitHub | scRNA-Seq | None | Value projection | Encoder | None | Cell type annotation | None |
STGRNS | ๐(OUP)Xu et al. 2023 | ๐ ๏ธGitHub | scRNA-Seq | None | Other | Encoder | None | GRN inference | None |
CIForm | ๐(OUP)Xu et al. 2023 | ๐ ๏ธGitHub | scRNA-Seq | None | Value projection | Encoder | None | Cell type annotation | None |
scFormer | ๐Cui et al. 2023 | Incomplete ๏ธGitHub | scRNA-Seq | Task specific | Value categorization: value binning | Encoder | Modified MLM with CE, cell token expression prediction, contrastive loss with cosine similarity, gene expression prediction | Cell type annotation, genetic perturbation effect prediction | Cell clustering |
Exceiver | ๐Connell et al. 2022 | ๐ ๏ธGitHub | scRNA-Seq | 0.5M / cross-tissue, human (Tabula Sapiens) | Other: value scaled embeddings | Encoder | Modified MLM with MSE, gene expression prediction | Cell type annotation, drug response prediction | Cell clustering |
TransCluster | ๐(Frontiers)Song et al. 2022 | ๐ ๏ธGitHub | scRNA-Seq | None | Value projection with LDA | Encoder | None | Cell type annotation | None |
scBERT | ๐(Nature MI)Yang et al. 2022 | ๐GitHub | scRNA-Seq | 1M / cross-tissue, human (PanglaoDB) | Value categorization, binning | Encoder | MLM with CE loss, gene expression prediction | Cell type annotation, unseen cell type detection | None |
iSEEEK | ๐(OUP)Shen et al. 2022 | ๐Github (dataset not public) | scRNA-Seq | 11.9M / cross-tissue, cross-species (list) | Ordering: rank-based | Encoder | MLM with CE loss | Marker gene classification | Cell clustering, pseudotime analysis, GRN inference |
Multitask learning | ๐Pang et al. 2020 | None | scRNA-Seq | 160k / brain, mouse (MBA) | Value projection | Other: autoencoder with two transformer encoders (?) | Modified MLM with MSE loss, gene expression prediction | None | Cell clustering |
Model | Paper | Code | Omic Modalities | Pre-training Dataset | Input Embedding | Architecture | SSL Tasks | Supervised Tasks | Zero-shot Tasks |
---|---|---|---|---|---|---|---|---|---|
scInterpreter | ๐Li et al. 2024 | None | scRNA-Seq | Natural Language GPT-3.5 and Llama-13b | Other: Ordering with embedding of the natural language representation | Decoder, GPT-3.5 and Llama-13b | NTP with CE loss and instruction finetuning (GPT-3.5 closed-source) | None | Cell type annotation (LLMs frozen, only small MLP trained) |
ChatCell | ๐โFang et al. 2024 | ๐ GitHub | scRNA-Seq | Natural Language T5 and natural language instructions | Other: Ordering with embedding as natural language with additional terms | Encoder-Decoder, T5 | NTP with CE loss | None (conditional sequence generation, prompting) | Simulation, cell type annotation, drug sensitivity prediction |
MarkerGeneBERT | ๐Cheng et al. 2023 | None | scRNA-Seq | Natural Language, PubMed and PubMed Central | Other: Natural language preprocessed with SciBERT | Encoder | MLM | Named Entity Recognition (NER), cell-biomarker sentence classification | None |
scELMo | ๐Liu, Chen and Zheng 2023 | Partial ๐GitHub | scRNA-Seq, CITE-Seq | Natural Language, Closed source | Other: NLP model embeddings of features weighted by the feature level in a cell (e.g. expression level) | Closed source (some open) | Closed source (some open) | Cell type annotation, Genetic perturbation effect prediction | Cell and gene embeddings in other perturbation models |
GenePT | ๐Chen and Zou 2023 | Partial ๐GitHub | scRNA-Seq | Natural Language, Closed source | Ordering: embedding as natural language | Closed source | Closed source | Gene function prediction | Cell clustering, GRN inference |
GPT-4 | ๐Z. Ji and Hou 2023 | None | scRNA-Seq | Natural Language, Closed source | Ordering: embedding as natural language | Closed source | Closed source | None (coditional sequence generation, prompting) | Cell type annotation |
Cell2Sentence | ๐Levine et al. 2023 | ๐ ๏ธGitHub | scRNA-Seq | Natural Language (GPT2) and scRNA-Seq (40k / immune, human) | Ordering: embedding as natural language | Decoder | NTP with CE loss | None | Simulation, cell type annotation |
Paper | Code | Omic Modalities | Evaluated Transformers | Tasks | Notes |
---|---|---|---|---|---|
๐He et al. 2024 | ๐ ๏ธGitHub | scRNA-Seq | scGPT | Cell type annotation | Evaluation of Parameter-Efficient Fine-Tuning (PEFT) for scGPT. Indicates that PEFT not only is more compute-efficient, but also results in better cell type prediction. |
๐(Nature MI)Khan et al. 2023 | ๐ ๏ธGitHub | scRNA-Seq | scBERT | Cell type annotation. Unseen cell type detection | Focused on imbalanced cell type classification. scBERT is sensitive to class imbalance. scBERT outperforms Seurat. scBERT doesn't perform well in unseen cell type detection. It benefits from SSL pretraining. |
๐Liu et al. 2023 | ๐ ๏ธGitHub | scRNA-Seq, scATAC-Seq, Spatial transcriptomics | scGPT, Geneformer, scBERT, tGPT, CellLM | Cell clustering, cell type annotation, multimodal embedding, GRN inference, gene expression imputation, genetic perturbation effect prediction, simulation, gene function prediction | Models aren't trained on the same datasets. scGPT is positioned as most versatile in terms of task diversity that it can tackle. Models other than transformer appear to be at least as good as transformers in most tasks. Transformers were shown to be sensitive to the choice of hyperparameters, such as learning rate and epochs. |
๐Boiarsky et al. 2023 | ๐ ๏ธGitHub | scRNA-Seq | scBERT, scGPT | Cell type annotation | Logistic regression appears to be as good as transformers in cell type annotation, even in low-data scenarios. |
๐Kedzierska et al. 2023 | ๐ ๏ธGitHub | scRNA-Seq | scGPT, Geneformer | Cell clustering | Zero-shot performance only. Both models appear unreliable. |
๐Alsabbagh et al. 2023 | ๐ ๏ธGitHub | scRNA-Seq | scGPT, Geneformer, scBERT | Cell type annotation | Focused on imbalanced cell type classification. Geneformer appears to be outperformed by scGPT and scBERT, where the two latter perform similarly. |
- ๐ - Preprint
- ๐ - Peer-Reviewed Publication
- ๐ ๏ธ - Fully reproducible
- ๐ - Code for evaluation only
- โ - Retracted or withdrawn
If you find the the data in this repository useful for your work, please cite:
@Article{TBA}