Coder Social home page Coder Social logo

Comments (2)

LTLA avatar LTLA commented on August 31, 2024

Many papers have been written about detection of overclustering, so it's a pretty well-studied problem. I daresay that most of these papers miss the mark, though, because they don't consider the real scientific question.

tl;dr A homogeneous cell type can still have interesting subclusters.

Consider a cell type whose members are MVN distributed in the expression space (or PC space, or whatever space you care to think of). I think we could both agree that this could be described as "homogeneous" - there aren't any clear subclusters and it's a smooth gradient of density in any direction of travel. However, I would argue that the structure inside this cluster could very well be biologically interesting if, say, an axis of significant variation was associated with some relevant pathway. In such cases, it would make sense to at least try to subcluster and see what you find. If you stop at "oh it's homogeneous", you would never be able to interrogate the heterogeneity within each cell type.

(One could say that it would be better to use trajectory inference for these continuous changes. This is fair enough but it's sometimes hard to figure out when to switch from clusters to trajectories if you don't already know it's continuous. So you usually need at least one subclustering step before you decide that it's continuous enough to switch.)

A long time ago, I decided to use some metrics (WCSS, Rand, modularity ratios) to see if I could automatically determine the appropriate number of clusters. I don't remember the exact results but I do remember being disappointed because I ended up with too-broad clusters, as that was the only thing that the various methods were confident in. Moreover, each of the methods had their own tunable parameters and thresholds, so in the end I was just trading one parameter (the number of clusters) for some other parameters without any clear benefit in interpretation.

I think the fundamental issue is that there isn't a clean mathematical way of expressing that some level of heterogeneity is biologically uninteresting in order to stop the subclustering. I might stop if my subclusters are all related to cell cycle, or metabolic activity, or some other boring thing, but others might get very excited by those same partitions, so who am I to say if they use those subclusters? A true "hard limit" of overclustering is when you start dropping below technical variation (e.g., the Poisson noise from sequencing), at which point you can confidently say that you've jumped the shark. But it takes a lot, like a lot, of overclustering to get to that point, so it's mostly a useless threshold.

In practice, people will always overcluster to see if there's anything interesting as they keep digging. Which is fine, it's all exploratory anyway, no one's really making quantitative claims here. Nonetheless, if you want to implement this method, I'd suggest making your own package; it seems pretty involved and I don't want to be on the hook to maintain it.

from bluster.

DarioS avatar DarioS commented on August 31, 2024

Interesting perspective.

from bluster.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.