igraph 1.3.0 is about to be released soon on CRAN, and it will change the implementati

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Summon <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Unit tests will break in igraph 1.3.0 about bluster HOT 7 CLOSED

ntamas commented on July 26, 2024

Unit tests will break in igraph 1.3.0

from bluster.

Comments (7)

LTLA commented on July 26, 2024

Thanks @ntamas for the heads-up. Yes, I also noticed this a few months ago when working directly with a recent version of the igraph C library. My solution at the time was to call the relevant seed setter at the C level.

However, it's not clear how I would do so via the R interface. I don't see an option to pass the seed in cluster_louvain on the latest commit of the R package. Maybe it is, as you say, as obvious as calling set.seed(), but this raises other issues, i.e., how do I guarantee that the seed value passed to the C interface is the same as the previous value?

For the purposes of this package, a different seed value is not too problematic, and we can easily set any seed to get the unit tests to pass. However, I also manage a number of Bioconductor workflows that will break if the clustering output changes, e.g., the OSCA book. Fixing these will be more problematic as I will need to manually go through all the examples/demonstrations and ensure that the text is referring to the right cluster number, otherwise my descriptions of the results will look rather silly.

Would it be too late to ask for a seed = argument to cluster_louvain()? This would still allow randomization by default but then I could pass seed = 42 to get the same results as before. I could also insulate my users from changes in the underlying package, at least until the next Bioconductor release rolls around (which gives me an excuse to have a change in behavior).

from bluster.

ntamas commented on July 26, 2024

Summon @vtraag :) Vincent made the improvements to the Louvain clustering algorithm and I'd appreciate his input on ways to make things deterministic if possible, and/or restore the old behaviour by some artificial means. As far as I know, set.seed() should be enough if the underlying implementation does not change, but if the underlying C code is reorganized so that the generated numbers from the RNG are used differently, then all bets are off. The only way to guarantee reproducibility for randomized algorithms is not only to set the seed but also to pin down the concrete implementation of the algorithm.

from bluster.

szhorvat commented on July 26, 2024

Would it be too late to ask for a seed = argument to cluster_louvain()? This would still allow randomization by default but then I could pass seed = 42 to get the same results as before.

You can do this right now by just using set.seed. All igraph function in R use R's own random number generator.

but this raises other issues, i.e., how do I guarantee that the seed value passed to the C interface is the same as the previous value?

If you are asking whether it is possible to get the same result as in previous versions, the answer is no: the current implementation cannot produce that. Previously, vertices were iterated over in vertex ID order. Now they are iterated over in a random order, which in practice will never be the same as before, for any seed.

However, I also manage a number of Bioconductor workflows that will break if the clustering output changes, e.g., the OSCA book. Fixing these will be more problematic as I will need to manually go through all the examples/demonstrations and ensure that the text is referring to the right cluster number, otherwise my descriptions of the results will look rather silly.

This indeed sounds painful. The problem is that the tradeoff is the following: either we don't guarantee the same output for the same seed across versions, or we won't be able to make any improvements to the function in the future. For stochastic functions, even bugfixes can change the output for a given seed.

I have seen people from non-quantitative fields use igraph rather naïvely and try to read too much into the output of community detection functions, assuming the output from a single run of a single stochastic community detection method to be generally trustworthy, without additional checks. Perhaps it's best to write tutorials in a way that avoids creating this misconception. If the output is robust across runs, there should be ways to canonicalize it. For example, there will be some vertices which are consistently part of the same community across all/most runs, and can thus be used to assign an identity to that community.

Regarding scientific reproducibility, a truly reproducible result is one that does not significantly depend on seeds.

from bluster.

LTLA commented on July 26, 2024

Thanks all for the comments. I suspected as much w.r.t. the changes to the implementation. Oh well. I suppose there is little choice but to slap a seed somewhere in bluster and let the chips fall where they may.

FWIW, the results of the community detection seem to be quite robust to upstream changes in our applications (e.g., gain/loss of a few cells, small changes in the distances due to different genes being involved). The problem lies in the fact that the robustness does not usually extend to the exact same cluster numbers, i.e., cluster 1 in one run is renamed to cluster X in another run. This usually requires some manual intervention to check that the right cluster is being referenced in the text.

(You might say that the text shouldn't make any mention of the cluster numbers. Previous versions of the workflows did attempt to follow that principle, but it greatly reduced the usefulness of the workflow as a teaching aide. Turns out to be quite difficult to demonstrate how to draw scientific conclusions from results when I can't actually refer to specific results.)

from bluster.

szhorvat commented on July 26, 2024

Would it perhaps help, for the sake of the book, to sort clusters by size so that the first one is the largest, then second largest and so one? Is the order stable w.r.t. the seed in the example you use?

from bluster.

LTLA commented on July 26, 2024

It might help. Not a guaranteed fix, but it might add some robustness. I know some other applications in the field do just that, so I suppose someone finds it useful.

I was thinking about the situation throughout the week and I remembered a few things about the book. First, the book actually contains its own validation code to check that the clusters in the text refer to the correct biological cell types. This is done by checking whether known genes that define a particular cell type are indeed active in the appropriate cluster number. If these checks don't pass, the book compilation throws an error to prompt me to fix it. In theory, I could flip this process such that the book auto-detects the appropriate cluster number to insert into the text (via inline Rmarkdown) based on each cluster's active genes. In the past I wasn't quite brave enough to allow the book to write itself, but I think this would be the most robust solution.

Secondly, I also remembered that the book consistently uses Walktrap as the default algorithm. So hopefully there should only be a few places that are impacted by igraph's Louvain change, which should help matters in the short term.

In any case, the immediate issue should be fixed by 1fde816; we'll just have to see what happens downstream.

from bluster.

szhorvat commented on July 26, 2024

This is done by checking whether known genes that define a particular cell type are indeed active in the appropriate cluster number. If these checks don't pass, the book compilation throws an error to prompt me to fix it. In theory, I could flip this process such that the book auto-detects the appropriate cluster number to insert into the text (via inline Rmarkdown) based on each cluster's active genes.

Using certain vertices to define a cluster's identity does sound like the right approach to me. However, it also sounds like a lot of work. I just wanted to let you know that while we can't guarantee not to make similar changes in the future, cluster_louvain() specifically is unlikely to change in the near future, as it's not really being worked on. Thus, I wouldn't invest too much time into an automated solution.

from bluster.

Unit tests will break in igraph 1.3.0 about bluster HOT 7 CLOSED

Comments (7)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent