m-py / anticlust Goto Github PK

View Code? Open in Web Editor NEW

28.0 28.0 5.0 4.77 MB

Subset partitioning via anticlustering

License: Other

R 76.35% C 23.62% Shell 0.03%

anticlust's People

Contributors

Stargazers

Watchers

Forkers

undocumeantit manalama minghao2016 statunizaga hanney100

anticlust's Issues

anticlust does stop due to large N

Dear Developers,

First of all, thank your for developing such a useful package in R.

I've ran into an issue applying anticlusting on large data set (N = 295k) consisting of 5 variables (2 numeric, 3 categorical).

Numeric variables: age and duration
Categorical variables: gender (2 levels), riskzone (42 levels) and language (4 levels).

set.seed(98772)
sample_tbl <- 
  sample_tbl %>% 
   mutate(group = anticlustering(sample_tbl[,c("age","duration")], 
    K = 2, 
    categories  = sample_tbl[,c("gender", "riskzone", "language")],
    objective = "kplus",
    standardize = TRUE))

After a few seconds I am running into an error:

Error: segfault from C stack overflow

Do you have any input/strategy on how to overcome this issue?

Thanks in advance

K-means optimization is incorrect for unequal group sizes

Apparently, with the optimized "local-updating" version of k-means anticlustering, the objective is incorrectly updated when the group sizes are unequal. Better results are obtained when recomputing the entire objective during each iteration. Reproducible example:

library(anticlust)

features <- schaper2019[, 3:6]

K <- 3
init <- sample(rep(1:3, nrow(schaper2019) * c(1/4, 1/4, 1/2)))

anticlusters <- anticlustering(
  features,
  K = init,
  objective = variance_objective,
  categories = schaper2019$room
)

mean_sd_tab(features, anticlusters)
# rating_consistent rating_inconsistent syllables     frequency     
# 1 "4.49 (0.24)"     "1.10 (0.07)"       "3.42 (1.10)" "18.33 (2.43)"
# 2 "4.49 (0.25)"     "1.10 (0.07)"       "3.42 (0.72)" "18.29 (2.24)"
# 3 "4.49 (0.25)"     "1.10 (0.06)"       "3.42 (0.94)" "18.31 (2.49)"

anticlusters <- anticlustering(
  features,
  K = init,
  objective = "variance",
  categories = schaper2019$room
)

mean_sd_tab(features, anticlusters)
# rating_consistent rating_inconsistent syllables     frequency     
# 1 "4.46 (0.24)"     "1.11 (0.07)"       "3.79 (1.10)" "19.75 (2.83)"
# 2 "4.51 (0.26)"     "1.11 (0.06)"       "2.96 (0.75)" "17.38 (1.74)"
# 3 "4.50 (0.24)"     "1.10 (0.07)"       "3.46 (0.82)" "18.06 (2.13)"

Feature request: fix/constrain cluster assignment in anticlustering()

For my application of anticlust it would be very useful if assignment of individual elements to clusters could be fixed or constrained a priori in anticlustering(). Instead of considering all K clusters for the constrained element, the algorithm would consider only a specific subset of clusters.

My use case is the assignment of versions of a psychological test to school classes during field testing. A small subset of classes have asked to use or not use a specific version; I still want to balance the covariates (averaged student characteristics) between versions across all classes taking these constraints into account.

A list of possible cluster memberships would be a straightforward way of specifying the constraints. Empty (NULL) list elements could denote unconstrained cluster selection. For example, with four elements and three clusters, the following list would specify unconstrained cluster selection for elements 1 and 2, constrain element 3 to cluster 2, and allow only clusters 2 or 3 for element 4:

list(
  NULL,       # unconstrained assignment for element 1
  c(1, 2, 3), # unconstrained assignment for element 2 (since we only have 3 clusters)
  2,          # element 3 fixed to cluster 2
  c(2, 3)     # element 4 constrained to clusters 2 or 3
)

Maybe this is already possible somehow but I was unable to figure out how. Also of course, it may well be that this is not possible to implement for some reason. But I still thought it worthwhile to signal that there is demand for this feature (if only from me…).

Finally, thank you for the anticlust package!

Replace current preclustering functions

@unDocUMeantIt provided an algorithm for anticlustering that is based on efficiently finding preclusters. From his function centroid_anticlustering(), read out these preclusters and use them as a backend in the balanced_clustering() function when method = "heuristic". My tests indicate that this clustering heuristic is faster and better than any that are currently implemented. This function should also be called when preclustering = TRUE in the anticlustering() function.

This means that I will be able to remove the following functions from the code base: equal_sized_kmeans(), greedy_balanced_k_clustering(), greedy_matching() and any lower level functions that are only called from within these functions.

Preclustering is broken when objective = "kplus"

Since the new variables are appended very early to the input data when objective = "kplus" in anticlustering(), preclustering (i.e., matching()) also uses these variables, which does not make sense and should be fixed.

Remove the argument `standardize`

There is no reason that features are standardized within the anticlustering() function, users could do it with a call to scale() before calling anticlustering(). I am not even sure if standardization makes much sense in the context of anticlustering (or at least I have yet to see any advantages).

Help wanted: Data sets!

Hello anticlust users!

It would be really helpful if the anticlust package included additional data sets to illustrate the application of anticlustering across diverse settings. If you have used anticlustering and are willing to share your data set openly, please contact me (in this issue or via email. If you have any questions, do not hesitate to ask! You would be mentioned as a contributor of the package.

I am particularly interested in data sets that meet one or more of the following criteria:

Was used in a scientific publication. Unfortunately, even in the days of open data and code, publications rarely seem to share the results of anticlustering [code/data] in open repositories.
Has many variables.
Has more than one categorical variable.
Is a large data set (N > 500, maybe)

If you are interested in sharing your data set, I would also be interested in the code you used for anticlustering (so I know, which anticlustering algorithm / objective was used etc.).

Accommodating NA values conditional on a categorical variable

Thanks for a great package - it has been fantastic for balancing stimuli sets in complex experiments. I was wondering if there's any way to include a variable with NA values conditional on a categorical variable. At the moment NA values are not permitted (understandable). Something like:

library(tidyverse)
library(anticlust)
df <- mtcars |> 
  mutate(hp = ifelse(vs == 0, NA, hp)) |> 
  select(mpg, disp, hp, vs)

anticlustering(
  df[,1:3],
  K = c(9, 9, 9, 5),
  objective = "variance",
  categories = df$vs
)

Right now, my best idea is to do the clustering separately for each category (in this case, each level of hp) and then combine the data into the final groups, but I was wondering if there are any other ways.

Cheers -

Using `standardize = TRUE` in `anticlustering()` can produce NAs

In anticlustering(), using standardize = TRUE (which simply leads to a call of scale()) can produce NAs in the data input. This can apparently happen for binary attributes. The anticlustering function currently does not deal with NAs, and it should return an informative error message suggesting that NAs were produced and that the user should use standardize = FALSE. Currently, an uninformative / misleading error is produced that does not inform users of the actual cause of the problem (using standardization).

library(anticlust)
library(palmerpenguins)

df <- na.omit(penguins)

# no male Gentoo penguins
df <- df[df$species != "Gentoo" | df$sex != "male", ]

binary_categories <- 
  categories_to_binary(df[, c("species", "sex")], use_combinations = TRUE)

groups <- anticlustering(
  binary_categories,
  K = 3,
  objective = "variance",
  repetitions = 10,
  standardize = TRUE
)
#> Fehler in validate_data_matrix(x) : 
#>   Your data contains `NA`. I cannot proceed because I cannot estimate similarity for data that has missing values. Sorry!

Also:

groups <- anticlustering(
  binary_categories,
  K = 3,
  objective = "variance",
  standardize = TRUE
)
#> Fehler in c_anticlustering(x, K, categories, objective, local_maximum = local_maximum,  : 
#>   NA/NaN/Inf in externem Funktionsaufruf (arg 1)

Reported and example by @einGlasRotwein.

Insufficient documentation in new version 0.8.6

Yesterday, I somewhat hastily released version 0.8.6., to get the code out and to release some mental load due to too many Git branches. I think it has all the code that it should have, but the documentation is slightly lacking, for example:

The "Details" of ?anticlustering do not explain the new argument cannot_link (at least I included an example). It should also be explained how the graph coloring ILP solver is selected (because it cannot be chosen by the user in this interface).
anticlustering() does not at all refer to the new objective = "average-diversity" (it is only referred to in the change log in NEWS.md)
The new arguments in bicriterion_anticlustering() also require more explanation in the documentation.
The DESCRIPTION is no longer technichally correct regarding the system requirements of the GLPK or Symphony, because anticlust now depends on lpSolve, which does not have any system requirements, because it includes the C source code of the lpsolve library. This should be reflected in the DESCRCIPTION.

I will extend this list when I become aware of additional omissions.

One reason for the lack of documentation is that I am currently working on a preprint that explains a lot of the changes in version 0.8.6 in more detail and on a "theoretical" level (especially regarding cannot-link constraints, but also the average diversity objective). However, I wanted to get the code out »now«, and I am not sure if / when I can finish the paper during summer. I should also insert a new vignette on anticlustering with cannot-link constraints then (because there are several ways to do it).

Remove argument `parallelize` from `anticlustering()`

As the exchange method is now the default algorithm (and is very strongly recommended in comparison to random sampling) it seems a bit too much to include a parallelize option for random sampling -- remove it. This also means removing the argument seed which is only relevant for making parallel random sampling reproducible. Removing two arguments from the anticlustering() is good because it has too many right now, and this change will clean up the code base in general.

BILS heuristic sometimes discards optimal partition from pareto set

The BILS heuristic sometimes does not return a partition that has an optimal value of the dispersion, even if it is initialized with a partition that has the optimal value (which contradicts the logic of the pareto set, which must contain a partition if it has the best value on one criterion).

Reproducible example:

data <- structure(c(2L, 2L, 3L, 5L, 1L, 3L, 3L, 2L, 5L, 1L, 4L, 4L, 1L, 
                    3L, 4L, 4L, 1L, 5L, 3L, 4L, 2L, 3L, 2L, 3L, 3L, 1L, 5L, 4L, 4L, 
                    5L, 3L, 2L, 4L, 5L, 2L, 3L, 3L, 1L, 3L, 2L, 3L, 3L, 1L, 2L, 2L, 
                    2L, 4L, 1L, 5L, 5L, 3L, 3L, 5L, 1L, 4L, 2L, 5L, 4L, 5L, 1L, 2L, 
                    3L, 1L, 1L, 3L, 2L, 4L, 5L, 3L, 4L, 5L, 3L, 1L, 5L, 2L, 4L, 2L, 
                    1L, 5L, 2L, 5L, 1L, 1L, 2L, 4L, 2L, 1L, 1L, 1L, 4L, 1L, 3L, 2L, 
                    1L, 1L, 5L, 5L, 4L, 4L, 4L, 5L, 4L, 1L, 3L, 5L, 4L, 2L, 1L, 4L, 
                    1L, 3L, 1L, 3L, 3L, 2L, 3L, 4L, 2L, 1L, 5L, 3L, 4L, 5L, 5L, 4L, 
                    1L, 1L, 4L, 3L, 5L, 2L, 1L, 4L, 4L, 4L, 3L, 2L, 2L, 3L, 5L, 4L, 
                    3L, 3L, 1L, 5L, 5L, 1L, 1L, 1L, 5L, 5L, 4L, 2L, 2L, 4L, 2L, 1L, 
                    3L, 5L, 3L, 1L, 2L, 4L, 4L, 1L, 5L, 1L, 4L, 3L, 4L, 5L, 5L, 4L, 
                    3L, 3L, 2L, 5L, 5L, 1L, 3L, 2L, 3L, 4L, 2L, 5L, 3L, 3L, 2L, 2L, 
                    4L, 2L, 1L, 4L, 1L, 5L, 2L, 5L, 2L, 2L, 3L, 2L, 3L, 3L, 1L, 1L, 
                    5L, 1L, 5L, 1L, 2L, 1L, 3L, 3L, 4L, 2L, 4L, 3L, 1L, 3L, 4L, 2L, 
                    5L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 4L, 5L, 2L, 3L, 1L, 5L, 3L, 2L, 
                    1L, 4L, 4L, 3L, 1L, 2L, 3L, 1L, 1L, 2L, 2L, 4L, 3L, 2L, 2L, 5L, 
                    1L, 3L, 2L, 2L, 4L, 4L, 4L, 5L, 5L, 4L, 4L, 2L, 5L, 2L, 2L, 4L, 
                    5L, 3L, 3L, 2L, 2L, 1L, 3L, 5L, 3L, 5L, 1L, 2L, 4L, 3L, 5L, 5L, 
                    5L, 4L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 1L, 1L, 1L, 3L, 1L, 
                    2L, 3L, 4L, 4L, 3L, 4L, 2L, 3L, 4L, 3L, 4L, 5L, 1L, 5L, 4L, 5L, 
                    1L, 1L, 1L, 2L, 2L, 4L, 1L, 2L, 1L, 3L, 3L, 1L, 4L, 3L, 5L, 2L, 
                    4L, 2L, 2L, 1L, 1L, 3L, 5L, 5L, 1L, 4L, 2L, 3L, 3L, 2L, 5L, 4L, 
                    1L, 4L, 3L, 5L, 5L, 4L, 5L, 1L, 5L, 4L, 5L, 5L, 5L, 3L, 4L, 5L, 
                    5L, 4L, 4L, 3L, 3L, 4L, 1L, 4L, 2L, 2L, 4L, 1L, 1L, 2L, 4L, 5L, 
                    3L, 1L, 3L, 3L, 2L, 4L, 1L, 3L, 5L, 5L, 5L, 2L, 5L, 5L, 1L, 5L, 
                    1L, 2L, 1L, 1L, 2L, 4L, 5L, 2L, 2L, 2L, 4L, 5L, 2L, 3L, 1L, 4L, 
                    3L, 3L, 3L, 2L, 4L, 4L, 2L, 3L, 1L, 4L, 1L, 1L, 4L, 3L, 5L, 2L, 
                    5L, 2L, 4L, 2L, 2L, 4L, 4L, 1L, 3L, 1L, 3L, 3L, 3L, 5L, 2L, 1L, 
                    5L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 2L, 5L, 5L, 2L, 5L, 2L, 3L, 1L, 
                    3L, 3L, 5L, 5L, 2L, 4L, 3L, 5L, 1L, 1L, 5L, 3L, 2L, 5L, 4L, 1L, 
                    5L, 5L, 1L, 1L, 5L, 4L, 5L, 4L, 5L, 5L, 1L, 2L, 5L, 1L, 5L, 4L, 
                    3L, 4L, 3L, 1L, 1L, 1L, 5L, 1L, 4L, 5L, 2L, 1L, 4L, 5L, 3L, 1L, 
                    4L, 4L, 1L, 1L, 3L, 4L, 5L, 1L, 1L, 5L, 3L, 4L, 3L, 2L, 2L, 4L, 
                    3L, 2L, 4L, 4L, 5L, 5L, 1L, 5L, 3L, 2L, 1L, 1L, 3L, 2L, 2L, 3L, 
                    5L, 5L, 5L, 4L, 1L, 2L, 4L, 5L, 2L, 4L, 1L, 5L, 4L, 5L, 2L, 5L, 
                    4L, 1L, 2L, 2L, 2L, 5L, 5L, 3L, 2L, 2L, 3L, 3L, 3L, 4L, 1L, 5L, 
                    2L, 1L, 1L, 1L, 5L, 1L, 2L, 4L, 2L, 5L, 2L, 2L, 5L, 4L, 3L, 5L, 
                    3L, 4L, 1L, 4L, 2L, 1L, 5L, 3L, 4L, 4L, 1L), dim = c(120L, 5L
                    ))
# optimal_dispersion(data, K = K)$dispersion # 2.236068
opt_groups <- c(1, 1, 4, 4, 5, 3, 2, 2, 1, 2, 4, 4, 5, 2, 1, 3, 2, 3, 3, 3, 
  1, 2, 1, 1, 1, 1, 3, 3, 2, 4, 1, 4, 2, 1, 2, 3, 1, 4, 1, 4, 2, 
  4, 3, 2, 3, 4, 5, 1, 5, 4, 1, 3, 3, 2, 5, 2, 1, 2, 5, 3, 5, 4, 
  5, 3, 5, 5, 2, 2, 5, 5, 1, 5, 2, 2, 4, 4, 3, 4, 3, 4, 1, 1, 2, 
  3, 5, 1, 5, 5, 2, 3, 4, 5, 1, 2, 2, 5, 4, 5, 4, 3, 5, 4, 4, 3, 
  3, 2, 3, 1, 1, 1, 2, 3, 5, 3, 5, 4, 4, 5, 4, 5)

set.seed(12345)
bils_groups <- bicriterion_anticlustering(data, K = opt_groups, R = c(1, 0))
dispersion_objective(data, opt_groups)
# [1] 2.236068
apply(bils_groups, 1, FUN = function(x) dispersion_objective(data, x))
#        1        2        3        5        6 
# 1.414214 1.414214 1.414214 1.414214 1.732051

Speed-optimize exchange method for objective = "distance"

Now that the exchange method is the default option for anticlustering, it is desirable that the distance objective is computed faster. Instead of recomputing all distances by cluster, do something like the following:

Store the distance matrix and use indexing to read the relevant distance after each swap
To read the relevant distances, store a boolean matrix where the entry [i,j] is TRUE whenever the elements i and j are part of the same cluster. After a swap, swap the columns and rows for the elements i and j (because they just exchange their cluster partners), but also set the entries [i, j] and [j, i] to FALSE (exchange partners are not part of the same cluster).
To compute the objective, use the boolean matrix (with a restriction to the upper or lower triangular part) on the distance matrix and call sum.

Equal group size

In the following cases, the restriction of the same group size is not needed and can be dropped (That means: allow for deviations of group sizes by 1):

unrestricted random sampling
categorical random sampling
(not for preclustered random sampling)

Using multiple initial partitions in `bicriterion_anticlustering()`

The documentation of bicriterion_anticlustering() states: "If multiple init_partitions are given, ensure that each partition (i.e., each row of init_partitions) has the exact same output of table()."

This is bad and it should not be up to the user. I can do that in anticlust, I already do it in the internal function add_unassigned_elements():

  # now sort labels by group size (so that each time this function is called, we get the same output of table())
  new_labels <- order(table(init), decreasing = TRUE)
  as.numeric(as.character(factor(init, levels = 1:K, labels = new_labels)))

--> Use this code in bicriterion_anticlustering() if the argument init_partitions is used.

kplus_anticlustering() does not correctly work with preclustering = TRUE

Internally, an augmented data set is passed to anticlustering(), and preclustering is then conducted on the basis of the "normal" features + the additional k-plus variables, which does not make sense. Therefore, kplus_anticlustering() needs to perform preclustering itself before calling anticlustering(). Calling anticlustering(..., objective = "kplus", preclustering = TRUE) works correctly however (but this is reduced in its functionality because it only considers means and variances and not higher order moments).

Argument `preclustering` should accept a preclustering vector

The preclustering argument should accept a preclustering vector as input, not only TRUE/FALSE. If the input is TRUE, the preclustering is computed within the function anticlustering.

If the preclustering argument accepts a clustering vector, this allows more flexibility in combining different methods (i.e., exact matching as preclustering, combined with a random sampling heuristic for anticlustering).

Non-standard evaluation

Maybe, at some point, anticlustering() should also be callable similarly to the following way:

anticlustering(
  iris,
  numeric_vars = c(Sepal.Length, Sepal.Width),
  categorical_vars = Species,
  K = 3
)

That is, the first argument is a generic data argument that includes the entire data frame that users work with and then specify only the column names to select numeric and categorical variables. It would probably just require to add the arguments numeric_vars and categorical_vars to anticlustering(), test if they exist, and then use non-standard-evaluation to extract the relevant data from the first argument. This would also be better integrated into a tidyverse workflow. All of this does not make sense if the data input is a distance matrix, which still has to be supported.

Currently, we would have to use the following, which may be less appealing to users:

anticlustering(
  iris[, c("Sepal.Length", "Sepal.Width")],
  categories = iris$Species,
  K = 3
)

Merge generic and specialized exchange methods

Right now I have three functions that implement an exchange algorithm; 2x specialized functions that are speed optimized for maximizing the kmeans and cluster editing objectives, respectively, 1x a generic version that can maximize any objective function.

This means there is a lot of redundant code. It would be desirable to merge the three functions.

The difficulty in merging is that each of the three functions has need for different data structures that are generated and updated throughout the exchange method. I need to test if it is possible to merge the functions in a reasonable way despite this difficulty.

Categorical constraints for ILP method

Add the possibility to include categorical constraints when method = "ilp"

Adding Elements to Existing Groups

I received this question via email and share with permission. It is similar to #46 regarding the inclusion of constraints on the cluster membership of items:

I have been using anticlust to assign subjects to groups and the library has been performing very well for me. One use case that I haven’t found a clean solution to is when I need to increase the sample size after I’ve already assigned some subjects to groups. Is there a way to do that? For example, if I have three groups (A,B,C) of 10 subjects and I find that I need to add 10 more subjects in a second round of experiments, is there a way to run anticlustering() with the previous 10 subjects already assigned to A, B,C and have them considered when I add the second round of ten to each group?

What I currently do is just run anticlustering() on the second group as if it were independent and try to make the final assignments manually. Not terribly hard to do, so it’s not a huge issue for me if there isn’t a way to do so (ie, I wouldn’t make a feature request), but I thought I would ask if there is a method that already exists.

Maximizing dispersion can crash when using default algorithm

This makes the R session crash quite reliably (about at least once every ten attempts):

library(anticlust)

N <- 100
K <- N/2

cannot_link <- c(1, rep(2:(N-1), each = 2), N)
cannot_link <- matrix(cannot_link, ncol = 2, byrow = TRUE)
cannot_link <- rbind(cannot_link, t(apply(cannot_link, 1, rev)))
mat <- matrix(1, nrow = N, ncol = N)
mat[cannot_link] <- -1
anticlustering(mat, K = K, objective = "dispersion")

I get

 *** caught segfault ***
address (nil), cause 'unknown'