Consider the following code: <div class="snippet-clipboard-content notranslate pos

Hum... I can see a few packages providing <a href="http://finzi.psych.upenn.edu/cgi-bi

Columns nested in the `boot` column become columns of lists with allowParallel = TRUE and certain seeds about cutpointr HOT 7 CLOSED

thie1e commented on May 30, 2024

Columns nested in the `boot` column become columns of lists with allowParallel = TRUE and certain seeds

from cutpointr.

Comments (7)

Thie1e commented on May 30, 2024

Hi, this happens because by default break_ties = c, so if multiple optimal cutpoints are found, all of them are returned. The metric values in e.g. sum_sens_spec_b correspond to the individual optimal cutpoints, so they become list columns, too.

In your example, there are multiple optimal cutpoints in bootstrap repetition 72:

opt_cut_b$boot[[1]]$optimal_cutpoint[[72]]
[1] 4 2

...and they both lead to the same in-sample metric value, as expected:

opt_cut_b$boot[[1]]$sum_sens_spec_b[[72]]
[1] 1.759109 1.759109

You can avoid multiple optimal cutpoints by setting break_ties = mean or break_ties = median for example:

set.seed(102)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 100,
                       silent = TRUE, allowParallel = T, break_ties = mean)
opt_cut_b %>% select(boot) %>% unnest

# A tibble: 100 x 23
   optimal_cutpoint AUC_b AUC_oob sum_sens_spec_b sum_sens_spec_o… acc_b acc_oob sensitivity_b
              <dbl> <dbl>   <dbl>           <dbl>            <dbl> <dbl>   <dbl>         <dbl>
 1                2 0.945   0.875            1.79             1.68 0.880   0.833         0.917
 2                2 0.959   0.877            1.83             1.68 0.867   0.874         0.971
 3                2 0.937   0.905            1.76             1.75 0.897   0.829         0.861
 4                2 0.927   0.955            1.72             1.87 0.867   0.873         0.854
 5                2 0.965   0.849            1.83             1.62 0.870   0.841         0.969
 6                2 0.909   0.976            1.73             1.88 0.863   0.890         0.865
 7                2 0.960   0.830            1.81             1.54 0.865   0.868         0.949
 8                2 0.908   0.959            1.73             1.82 0.836   0.897         0.9  
 9                2 0.935   0.958            1.75             1.85 0.874   0.856         0.875
10                2 0.891   0.951            1.70             1.75 0.872   0.836         0.829
# … with 90 more rows, and 15 more variables: sensitivity_oob <dbl>, specificity_b <dbl>,
#   specificity_oob <dbl>, kappa_b <dbl>, kappa_oob <dbl>, TP_b <dbl>, FP_b <dbl>, TN_b <int>, FN_b <int>,
#   TP_oob <dbl>, FP_oob <dbl>, TN_oob <int>, FN_oob <int>, roc_curve_b <list>, roc_curve_oob <list>

Multiple optimal cutpoints can of course be found regardless of allowParallel, e.g. this returns multiple optimal cutpoints when allowParallel = F:

set.seed(222)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 1000,
                       silent = TRUE, allowParallel = F)
opt_cut_b %>% select(boot) %>% unnest

The results differ with allowParallel = F and allowParallel = T, however, with the same seed, because the random number generator is different when parallelization is activated. In that case we use the doRNG package for reproducible parallel foreach loops.

from cutpointr.

xrobin commented on May 30, 2024

I see. I guess the main problem is that this breaks some of the pipelines described in the README file and the vignette.

For instance the example in "Accessing data, roc_curve, and boot":

set.seed(123)
opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 20)

will quickly turn into something rather useless with only minor and unpredictable changes:

set.seed(222)
opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 100)
summary(opt_cut$boot[[1]]$optimal_cutpoint)
       Min. 1st Qu. Median Mean 3rd Qu. Max. IQR Valid NA's
  [1,]    2    2.00    2.0  2.0    2.00    2 0.0     1    0
  [2,]    2    2.00    2.0  2.0    2.00    2 0.0     1    0
  [3,]    2    2.00    2.0  2.0    2.00    2 0.0     1    0
  [4,]    2    2.00    2.0  2.0    2.00    2 0.0     1    0
  [5,]    4    4.00    4.0  4.0    4.00    4 0.0     1    0
  [6,]    3    3.00    3.0  3.0    3.00    3 0.0     1    0
[...]

Maybe that can be fixed by changing the default break_ties to something that actually breaks them as you suggest?

from cutpointr.

Thie1e commented on May 30, 2024

This behavior is basically a compromise:

If we break ties per default by mean or median, this will sometimes lead to suboptimal cutpoints. That's the main reason why the default is break_ties = c. For example, with the above code and break_ties = mean we'll get sensitivity + specificity = 1.735 instead of 1.759:

> opt_cut_b$boot[[1]]$optimal_cutpoint[[72]]
[1] 3
> opt_cut_b$boot[[1]]$sum_sens_spec_b[[72]]
[1] 1.734818

The drawback is obviously that some columns are not type stable with break_ties = c. I agree that functions like summary don't work well with nested tibbles or list columns out of the box.

By the way, I cannot reproduce the output above. Maybe a difference in package or R versions? For me it looks like this:

summary(opt_cut_b$boot[[1]]$optimal_cutpoint)
[...]
       Length Class  Mode   
[...]
[70,] 1      -none- numeric
[71,] 1      -none- numeric
[72,] 2      -none- numeric
[73,] 1      -none- numeric
[74,] 1      -none- numeric
[75,] 1      -none- numeric
[...]

At least there are no misleading numeric results here.

The question is, if we rather want type stability of all columns or mathematically optimal cutpoints in all scenarios. After thinking about this, the only easily implementable option for the former that I can imagine would be to change the default to break_ties = median (thus still returning the optimal metric value if there is an odd number of optimal cutpoints) and to print a message if multiple optimal cutpoints are found and the function in break_ties is applied.

from cutpointr.

xrobin commented on May 30, 2024

I can see the same thing as you after upgrading the following packages:

1: assertthat (0.2.0 -> 0.2.1) [CRAN]
2: cli (1.0.1 -> 1.1.0) [CRAN]
3: colorspace (1.4-0 -> 1.4-1) [CRAN]
4: ggplot2 (3.1.0 -> 3.1.1) [CRAN]
5: glue (1.3.0 -> 1.3.1) [CRAN]
6: gtable (0.2.0 -> 0.3.0) [CRAN]
7: lazyeval (0.2.1 -> 0.2.2) [CRAN]
8: purrr (0.3.0 -> 0.3.2) [CRAN]
9: Rcpp (1.0.0 -> 1.0.1) [CRAN]
10: rlang (0.3.1 -> 0.3.4) [CRAN]
11: stringi (1.3.1 -> 1.4.3) [CRAN]
12: tibble (2.0.1 -> 2.1.1) [CRAN]
13: tidyr (0.8.2 -> 0.8.3) [CRAN]

One of them must have changed something.

I agree the mean or median isn't a good option, it would give you values that don't actually exist. A warning could be nice but will only get you so far. So I guess that's just the intended behavior, nothing to worry about just something weird. Thanks for the help!

from cutpointr.

Thie1e commented on May 30, 2024

I tried to check which package might have caused the different output of summary but could not figure it out. I don't think it has to do with any of the above packages. If we look at methods(summary) there's no summary.list that might cause a different output (class(opt_cut$boot[[1]]$optimal_cutpoint) is simply list). The output I have shown comes from summary.default.

Maybe you had loaded a different package that has a summary.list function?

from cutpointr.

xrobin commented on May 30, 2024

Hum... I can see a few packages providing summary.list, but none that would've been loaded (or even installed).

from cutpointr.

Thie1e commented on May 30, 2024

OK, thanks for looking into this again. I checked and it's not the function from plink. Anyway, there's probably not much we could do if a summary.list function was loaded by the user.

from cutpointr.

Columns nested in the `boot` column become columns of lists with allowParallel = TRUE and certain seeds about cutpointr HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent