Coder Social home page Coder Social logo

Comments (7)

LTLA avatar LTLA commented on July 26, 2024

Thanks @ntamas for the heads-up. Yes, I also noticed this a few months ago when working directly with a recent version of the igraph C library. My solution at the time was to call the relevant seed setter at the C level.

However, it's not clear how I would do so via the R interface. I don't see an option to pass the seed in cluster_louvain on the latest commit of the R package. Maybe it is, as you say, as obvious as calling set.seed(), but this raises other issues, i.e., how do I guarantee that the seed value passed to the C interface is the same as the previous value?

For the purposes of this package, a different seed value is not too problematic, and we can easily set any seed to get the unit tests to pass. However, I also manage a number of Bioconductor workflows that will break if the clustering output changes, e.g., the OSCA book. Fixing these will be more problematic as I will need to manually go through all the examples/demonstrations and ensure that the text is referring to the right cluster number, otherwise my descriptions of the results will look rather silly.

Would it be too late to ask for a seed = argument to cluster_louvain()? This would still allow randomization by default but then I could pass seed = 42 to get the same results as before. I could also insulate my users from changes in the underlying package, at least until the next Bioconductor release rolls around (which gives me an excuse to have a change in behavior).

from bluster.

ntamas avatar ntamas commented on July 26, 2024

Summon @vtraag :) Vincent made the improvements to the Louvain clustering algorithm and I'd appreciate his input on ways to make things deterministic if possible, and/or restore the old behaviour by some artificial means. As far as I know, set.seed() should be enough if the underlying implementation does not change, but if the underlying C code is reorganized so that the generated numbers from the RNG are used differently, then all bets are off. The only way to guarantee reproducibility for randomized algorithms is not only to set the seed but also to pin down the concrete implementation of the algorithm.

from bluster.

szhorvat avatar szhorvat commented on July 26, 2024

Would it be too late to ask for a seed = argument to cluster_louvain()? This would still allow randomization by default but then I could pass seed = 42 to get the same results as before.

You can do this right now by just using set.seed. All igraph function in R use R's own random number generator.

but this raises other issues, i.e., how do I guarantee that the seed value passed to the C interface is the same as the previous value?

If you are asking whether it is possible to get the same result as in previous versions, the answer is no: the current implementation cannot produce that. Previously, vertices were iterated over in vertex ID order. Now they are iterated over in a random order, which in practice will never be the same as before, for any seed.

However, I also manage a number of Bioconductor workflows that will break if the clustering output changes, e.g., the OSCA book. Fixing these will be more problematic as I will need to manually go through all the examples/demonstrations and ensure that the text is referring to the right cluster number, otherwise my descriptions of the results will look rather silly.

This indeed sounds painful. The problem is that the tradeoff is the following: either we don't guarantee the same output for the same seed across versions, or we won't be able to make any improvements to the function in the future. For stochastic functions, even bugfixes can change the output for a given seed.

I have seen people from non-quantitative fields use igraph rather naïvely and try to read too much into the output of community detection functions, assuming the output from a single run of a single stochastic community detection method to be generally trustworthy, without additional checks. Perhaps it's best to write tutorials in a way that avoids creating this misconception. If the output is robust across runs, there should be ways to canonicalize it. For example, there will be some vertices which are consistently part of the same community across all/most runs, and can thus be used to assign an identity to that community.

Regarding scientific reproducibility, a truly reproducible result is one that does not significantly depend on seeds.

from bluster.

LTLA avatar LTLA commented on July 26, 2024

Thanks all for the comments. I suspected as much w.r.t. the changes to the implementation. Oh well. I suppose there is little choice but to slap a seed somewhere in bluster and let the chips fall where they may.

FWIW, the results of the community detection seem to be quite robust to upstream changes in our applications (e.g., gain/loss of a few cells, small changes in the distances due to different genes being involved). The problem lies in the fact that the robustness does not usually extend to the exact same cluster numbers, i.e., cluster 1 in one run is renamed to cluster X in another run. This usually requires some manual intervention to check that the right cluster is being referenced in the text.

(You might say that the text shouldn't make any mention of the cluster numbers. Previous versions of the workflows did attempt to follow that principle, but it greatly reduced the usefulness of the workflow as a teaching aide. Turns out to be quite difficult to demonstrate how to draw scientific conclusions from results when I can't actually refer to specific results.)

from bluster.

szhorvat avatar szhorvat commented on July 26, 2024

Would it perhaps help, for the sake of the book, to sort clusters by size so that the first one is the largest, then second largest and so one? Is the order stable w.r.t. the seed in the example you use?

from bluster.

LTLA avatar LTLA commented on July 26, 2024

It might help. Not a guaranteed fix, but it might add some robustness. I know some other applications in the field do just that, so I suppose someone finds it useful.

I was thinking about the situation throughout the week and I remembered a few things about the book. First, the book actually contains its own validation code to check that the clusters in the text refer to the correct biological cell types. This is done by checking whether known genes that define a particular cell type are indeed active in the appropriate cluster number. If these checks don't pass, the book compilation throws an error to prompt me to fix it. In theory, I could flip this process such that the book auto-detects the appropriate cluster number to insert into the text (via inline Rmarkdown) based on each cluster's active genes. In the past I wasn't quite brave enough to allow the book to write itself, but I think this would be the most robust solution.

Secondly, I also remembered that the book consistently uses Walktrap as the default algorithm. So hopefully there should only be a few places that are impacted by igraph's Louvain change, which should help matters in the short term.

In any case, the immediate issue should be fixed by 1fde816; we'll just have to see what happens downstream.

from bluster.

szhorvat avatar szhorvat commented on July 26, 2024

This is done by checking whether known genes that define a particular cell type are indeed active in the appropriate cluster number. If these checks don't pass, the book compilation throws an error to prompt me to fix it. In theory, I could flip this process such that the book auto-detects the appropriate cluster number to insert into the text (via inline Rmarkdown) based on each cluster's active genes.

Using certain vertices to define a cluster's identity does sound like the right approach to me. However, it also sounds like a lot of work. I just wanted to let you know that while we can't guarantee not to make similar changes in the future, cluster_louvain() specifically is unlikely to change in the near future, as it's not really being worked on. Thus, I wouldn't invest too much time into an automated solution.

from bluster.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.