Is there any summary statistic in Bluster which is known to be robust to the null data

Robustness to Null Dataset Problem about bluster HOT 2 CLOSED

DarioS commented on August 31, 2024

Robustness to Null Dataset Problem

from bluster.

Comments (2)

LTLA commented on August 31, 2024

Many papers have been written about detection of overclustering, so it's a pretty well-studied problem. I daresay that most of these papers miss the mark, though, because they don't consider the real scientific question.

tl;dr A homogeneous cell type can still have interesting subclusters.

Consider a cell type whose members are MVN distributed in the expression space (or PC space, or whatever space you care to think of). I think we could both agree that this could be described as "homogeneous" - there aren't any clear subclusters and it's a smooth gradient of density in any direction of travel. However, I would argue that the structure inside this cluster could very well be biologically interesting if, say, an axis of significant variation was associated with some relevant pathway. In such cases, it would make sense to at least try to subcluster and see what you find. If you stop at "oh it's homogeneous", you would never be able to interrogate the heterogeneity within each cell type.

(One could say that it would be better to use trajectory inference for these continuous changes. This is fair enough but it's sometimes hard to figure out when to switch from clusters to trajectories if you don't already know it's continuous. So you usually need at least one subclustering step before you decide that it's continuous enough to switch.)

A long time ago, I decided to use some metrics (WCSS, Rand, modularity ratios) to see if I could automatically determine the appropriate number of clusters. I don't remember the exact results but I do remember being disappointed because I ended up with too-broad clusters, as that was the only thing that the various methods were confident in. Moreover, each of the methods had their own tunable parameters and thresholds, so in the end I was just trading one parameter (the number of clusters) for some other parameters without any clear benefit in interpretation.

I think the fundamental issue is that there isn't a clean mathematical way of expressing that some level of heterogeneity is biologically uninteresting in order to stop the subclustering. I might stop if my subclusters are all related to cell cycle, or metabolic activity, or some other boring thing, but others might get very excited by those same partitions, so who am I to say if they use those subclusters? A true "hard limit" of overclustering is when you start dropping below technical variation (e.g., the Poisson noise from sequencing), at which point you can confidently say that you've jumped the shark. But it takes a lot, like a lot, of overclustering to get to that point, so it's mostly a useless threshold.

In practice, people will always overcluster to see if there's anything interesting as they keep digging. Which is fine, it's all exploratory anyway, no one's really making quantitative claims here. Nonetheless, if you want to implement this method, I'd suggest making your own package; it seems pretty involved and I don't want to be on the hook to maintain it.

from bluster.

DarioS commented on August 31, 2024

Interesting perspective.

from bluster.

Robustness to Null Dataset Problem about bluster HOT 2 CLOSED

Comments (2)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent