theislab / atlas-feature-selection-benchmark Goto Github PK
View Code? Open in Web Editor NEWCode for benchmarking the effect of feature selection on scRNA-seq atlas construction and use
License: MIT License
Code for benchmarking the effect of feature selection on scRNA-seq atlas construction and use
License: MIT License
Promising:
Not sooo promising:
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
Normalised Mutal Information (NMI) measures the similarity between two sets of labels (in this case the ground truth cell labels and cluster assignments). It was used as part of the scIB project.
Anything else you think is important about the metric
Thanks for taking the time to suggest a new dataset!
Please briefly describe the suggested dataset: What is in the dataset (tissue, technology, number of cells etc.)? What kind of batches does it have? What kind of cell annotations does it have? Why would it be a good fit for the project?
The Human Endoderm Atlas [HEA] is a reference atlas of multiple endodermal organs from human development. It contains 34 samples from 14 individuals across 15 tissues from six organs. High quality labels with two hierarchical levels (major cell type, 7; cell type, 27) are available.
Links to information about the dataset
Anything else you think is important about the dataset
Thanks for taking the time to suggest a new feature!
Please briefly describe the suggested feature: What is it? How would it work? Why would it be a good fit for the project?
Update scIB environments to the latest release once changes have been merged
Anything else you think is important about the feature
Required to avoid workarounds in:
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The Local Inverse Simpson’s Index (LISI) measures diversity in the neighbourhood of a cell. The cell-type variant (cLISI) gives better scores when the neighbourhood consists of the same labels as the target cell. It was used as part of the scIB project where a more flexible graph-based implementation was developed.
Anything else you think is important about the metric
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The k-Nearest Neighbour Batch effect Test (kBET) uses a statistical test to measure the mixing of batches and labels within the neighbourhood of a cell. It was used as part of the scIB project.
Anything else you think is important about the metric
Thanks for taking the time to suggest a new dataset!
Please briefly describe the suggested dataset: What is in the dataset (tissue, technology, number of cells etc.)? What kind of batches does it have? What kind of cell annotations does it have? Why would it be a good fit for the project?
Open Problems in Single-Cell Analysis produced a dataset for a NeurIPs 2021 competition. This contains multiomics samples from several individuals, produced at different sequencing facilities and with consensus labels.
Anything else you think is important about the dataset
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The Adjusted Rand Index (ARI) measures the similarity between two sets of labels (in this case the ground truth label and a clustering). It adjusts for the similarity that would be expected by chance depending on the size of the clusters. It was used as part of the scIB project.
Anything else you think is important about the metric
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The F1 score is a commonly used evaluation metric that measures classification performance as the harmonic mean of precision and recall.
Links to information about the metric
Anything else you think is important about the metric
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
triku is a graph-based feature selection method that selects features that show an unexpected number of zero counts and whose expression is located in cells that have similar expression profiles.
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Links to information about the method
Anything else you think is important about the method
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The Adjusted Silhouette Width (ASW) measures the compactness of clusters in a dataset. For the scIB project a modified version was developed which evaluates how spread (and therefore weel integrated) batches are.
Anything else you think is important about the metric
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The graph connectivity score measures how connected the subgraphs for each label are. For a well-integrated dataset it is expected that cells with the same label will be well connected while for a poorly integrated dataset they will be more disconnected. It was used as part of the scIB project.
Anything else you think is important about the metric
Currently removal of genes with 0 counts is done before splitting into reference and query datasets. This means they have the same feature sets but means there is a chance that one of them could contain some 0 genes. Consider if filtering should be done on both separately and the intersection used.
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
SCMER is a feature selection method designed for single-cell data analysis. It selects a compact set of markers that preserve the manifold in the original data. It can also be used for multimodal data integration by using features in one modality to match the manifold of another modality.
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Links to information about the method
Anything else you think is important about the method
Thanks for taking the time to suggest a new feature!
Please briefly describe the suggested feature: What is it? How would it work? Why would it be a good fit for the project?
Adapt pipeline to pass reference and query integrated objects to metrics. This is required by some of the mapping/unseen population metrics.
Anything else you think is important about the feature
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
Add extensions to existing integration metrics based on clustering that take into account the imbalance in ground truth labels.
Links to information about the metric
Anything else you think is important about the metric
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
Describes how well the local structure of each group prior to integration is preserved after integration
Links to information about the metric
Anything else you think is important about the metric
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The isolated labels score measures how well labels that were present in few samples can be distinguished in the integrated dataset. It can be calculated using either a clustering-based approach with the F1 score or unsupervised ASW. It was used as part of the scIB project.
Anything else you think is important about the metric
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The cell cycle conservation score measures how much of the variance associated with the cell cycle in individual batches remains after integration. It was used as part of the scIB project.
Anything else you think is important about the metric
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
LISI between the reference and the query.
Links to information about the metric
Anything else you think is important about the metric
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
Select features based on high Pearson residuals
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Links to information about the method
Anything else you think is important about the method
Can probably be implemented by modifying the existing scanpy script
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
M3Drop fits a Michaelis-Menten model to the pattern of dropouts in single-cell RNASeq data. This model is used as a null to identify significantly variable (i.e. differentially expressed) genes for use in downstream analysis.
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Links to information about the method
Anything else you think is important about the method
Thanks for taking the time to suggest a new dataset!
Please briefly describe the suggested dataset: What is in the dataset (tissue, technology, number of cells etc.)? What kind of batches does it have? What kind of cell annotations does it have? Why would it be a good fit for the project?
The Human Lung Cell Atlas is a comprehensive catalogue of cells in the human lung. It contains samples from hundreds of individuals and high-quality consensus cell labels at different hierarchical levels.
Anything else you think is important about the dataset
In the methods workflow line 169, "method-random" instead of "method-triku"
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
Hotspot is a graph-based method that selects features that are associated with similarity between cells (represented as a graph).
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Links to information about the method
_Anything else you think is important about the method
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? What does it measure? Why would it be a good fit for the project?
DUBStepR (Determining the Underlying Basis using Step-wise Regression) is a feature selection algorithm for cell type identification in single-cell RNA-sequencing data. It is based on the intuition that cell-type-specific marker genes tend to be well correlated with each other, i.e. they typically have strong positive and negative correlations with other marker genes.
Links to information about the metric
Anything else you think is important about the metric
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The Adjusted Silhouette Width (ASW) measures the compactness of clusters in a dataset. By calculating it on cell labels we evaluate whether cells of the same type are nearby or separated. It was used as part of the scIB project.
Anything else you think is important about the metric
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
{Seurat} is the most commonly used R toolbox. It contains a highly variable gene feature selection function that selects features by either performing a variance stabilising transformation and selecting variable features ("vst"), binning features by expression and selecting over-dispersed features ("mean.var.plot") or simply selecting the features with highest dispersion values ("dispersion").
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
Selects features based on those that have an excess of negative correlations
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Links to information about the method
Anything else you think is important about the method
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
singleCellHaystack is a package for predicting differentially expressed genes (DEGs) in single-cell transcriptome data without the use of cell labels. It uses Kullback-Leibler Divergence to find genes that are expressed in subsets of cells that are non-randomly positioned in a reduced dimensional space.
Please describe any variants of the method (different parameters, number of selected features etc.), if any
use.advanced.sampling
modeLinks to information about the method
Anything else you think is important about the method
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
Correlation between the KNN of query cells before and after mapping.
Links to information about the metric
Anything else you think is important about the metric
May be possible to reuse the Symphony implementation (preferred) but if not should not be too difficult to reimplement
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
The single-cell Stably Expressed Gene (scSEG) index is available in the {scMerge} package and measures how stably expressed a gene is across a dataset and can be used to select stably expressed genes. This is the opposite of typical methods which look for highly variable genes and would serve as a negative control for the benchmark.
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Links to information about the method
Anything else you think is important about the method
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The Jaccard index, or Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets.
Links to information about the metric
Anything else you think is important about the metric
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
The single-cell Projective Non-negative Matrix Factorization (scPNMF) method performs a dimensionality reduction and then filters bases to find those that show evidence of multimodal structure. Features can then be selected based on those bases.
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Links to information about the method
Anything else you think is important about the method
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
scanpy is the most commonly used Python toolbox and contains three methods for selecting highly variable genes: "seurat" (default), "cell_ranger" and "seurat_v3". For "seurat" and "cell_ranger" genes are binned by mean expression and normalised dispersions calculated per bin. "seurat" uses thresholds to select features while "cell_ranger" uses a target number of genes.
"seurat_v3" starts with raw counts (rather than log normalised) and applies a variance stabilising transformation and ranking genes by normalised variance.
More details on methods in the scanpy documentation.
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Anything else you think is important about the method
Thanks for taking the time to suggest a new feature!
Please briefly describe the suggested feature: What is it? How would it work? Why would it be a good fit for the project?
The pipeline should keep track of the species for each dataset and pass that information as needed (particularly to metrics). Required for the cell conservation score #18.
Anything else you think is important about the feature
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The PCA Regression (PCR) comparison measures the amount of variance in the dataset explained by the batch label before and after integration. If the integration performs well more variance should be explained by batch prior to integration. It was used as part of the scIB project.
Anything else you think is important about the metric
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
Classification metrics that are calculated per label (such as F1 score) can be averaged in various ways. The suggestion would be to weight averages by the rarity of the labels to focus on correct classifications of uncommon labels.
Links to information about the metric
Anything else you think is important about the metric
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
The Orchestrating Single-Cell Analysis with Bioconductor book describes how to use core Bioconductor packages to analyse single-cell data. They propose a method a method for feature selection that considers batches in the data and selects features with additional biological variation.
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Anything else you think is important about the method
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
Distance between an average reconstructed cell and real query cells
Links to information about the metric
Anything else you think is important about the metric
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
Use the MILO method to identify cells with enriched query neighbourhoods in unseen cell labels.
Links to information about the metric
Anything else you think is important about the metric
MIght need some kind of summarisation to get an overall score.
Thanks for taking the time to fill out this bug report!
Please use Markdown formatting for any code snippets.
What happened?
Briefly describe the issue
On the small test simulation datasets the NBumi method selects zero genes which breaks the integration stage of the pipeline
What were you doing?
Briefly describe what led to the issue. A reproducible example or other code snippets are great
What did you see?
_Include any error messages, log output or other output that could help diagnose the problem
Subsetting to 0 selected features...
Setting up AnnData for scVI...
...
IndexError: index 172 is out of bounds for size 0
Proposed solution
If you have a suggestion for how to solve the issue we would love to hear it!
Options:
Your environment
Please include the information relevant to your issue
HMGU server
Anything else?
Anything else you want to tell us about the issue
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
Select features based on simple statistical values
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Links to information about the method
Anything else you think is important about the method
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
Select features by taking the top marker genes for each label as detected using the Wilcoxon rank sum test
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Links to information about the method
Anything else you think is important about the method
Thanks for taking the time to suggest a new method!
Selects features based on high deviance from a constant multinomial model
Links to information about the method
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The Matthews Correlation Coefficient (MCC) measures the quality of binary and multiclass classifications and is regarded as a balanced measure that can be used even if the classes are of very different sizes.
Anything else you think is important about the metric
Thanks for taking the time to suggest a new feature!
Please briefly describe the suggested feature: What is it? How would it work? Why would it be a good fit for the project?
Update all R environments to use R 4.2 and Bioconductor 3.16
Anything else you think is important about the feature
Thanks for taking the time to suggest a new method!
Please briefly describe the suggested method: What is it? How does it work? Why would it be a good fit for the project?
Select features based on excess coefficient of variation
Please describe any variants of the method (different parameters, number of selected features etc.), if any
Links to information about the method
Anything else you think is important about the method
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
The Local Inverse Simpson’s Index (LISI) measures diversity in the neighbourhood of a cell. The integration variant (iLISI) gives better scores when the neighbourhood consists of different labels to the target cell. It was used as part of the scIB project where a more flexible graph-based implementation was developed.
Anything else you think is important about the metric
Thanks for taking the time to suggest a new metric!
Please briefly describe the suggested metric: What is it? What does it measure? Why would it be a good fit for the project?
Accuracy Loss of Cell type Self-projection, difference in accuracy between classifiers trained on the query only (per batch) vs the query in the reference space
Links to information about the metric
Anything else you think is important about the metric
Not sure that the exact metric is in the package or if it needs some wrapping code
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.