Coder Social home page Coder Social logo

Comments (7)

Thie1e avatar Thie1e commented on May 30, 2024

Hi, this happens because by default break_ties = c, so if multiple optimal cutpoints are found, all of them are returned. The metric values in e.g. sum_sens_spec_b correspond to the individual optimal cutpoints, so they become list columns, too.

In your example, there are multiple optimal cutpoints in bootstrap repetition 72:

opt_cut_b$boot[[1]]$optimal_cutpoint[[72]]
[1] 4 2

...and they both lead to the same in-sample metric value, as expected:

opt_cut_b$boot[[1]]$sum_sens_spec_b[[72]]
[1] 1.759109 1.759109

You can avoid multiple optimal cutpoints by setting break_ties = mean or break_ties = median for example:

set.seed(102)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 100,
                       silent = TRUE, allowParallel = T, break_ties = mean)
opt_cut_b %>% select(boot) %>% unnest

# A tibble: 100 x 23
   optimal_cutpoint AUC_b AUC_oob sum_sens_spec_b sum_sens_spec_o… acc_b acc_oob sensitivity_b
              <dbl> <dbl>   <dbl>           <dbl>            <dbl> <dbl>   <dbl>         <dbl>
 1                2 0.945   0.875            1.79             1.68 0.880   0.833         0.917
 2                2 0.959   0.877            1.83             1.68 0.867   0.874         0.971
 3                2 0.937   0.905            1.76             1.75 0.897   0.829         0.861
 4                2 0.927   0.955            1.72             1.87 0.867   0.873         0.854
 5                2 0.965   0.849            1.83             1.62 0.870   0.841         0.969
 6                2 0.909   0.976            1.73             1.88 0.863   0.890         0.865
 7                2 0.960   0.830            1.81             1.54 0.865   0.868         0.949
 8                2 0.908   0.959            1.73             1.82 0.836   0.897         0.9  
 9                2 0.935   0.958            1.75             1.85 0.874   0.856         0.875
10                2 0.891   0.951            1.70             1.75 0.872   0.836         0.829
# … with 90 more rows, and 15 more variables: sensitivity_oob <dbl>, specificity_b <dbl>,
#   specificity_oob <dbl>, kappa_b <dbl>, kappa_oob <dbl>, TP_b <dbl>, FP_b <dbl>, TN_b <int>, FN_b <int>,
#   TP_oob <dbl>, FP_oob <dbl>, TN_oob <int>, FN_oob <int>, roc_curve_b <list>, roc_curve_oob <list>

Multiple optimal cutpoints can of course be found regardless of allowParallel, e.g. this returns multiple optimal cutpoints when allowParallel = F:

set.seed(222)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 1000,
                       silent = TRUE, allowParallel = F)
opt_cut_b %>% select(boot) %>% unnest

The results differ with allowParallel = F and allowParallel = T, however, with the same seed, because the random number generator is different when parallelization is activated. In that case we use the doRNG package for reproducible parallel foreach loops.

from cutpointr.

xrobin avatar xrobin commented on May 30, 2024

I see. I guess the main problem is that this breaks some of the pipelines described in the README file and the vignette.

For instance the example in "Accessing data, roc_curve, and boot":

set.seed(123)
opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 20)

will quickly turn into something rather useless with only minor and unpredictable changes:

set.seed(222)
opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 100)
summary(opt_cut$boot[[1]]$optimal_cutpoint)
       Min. 1st Qu. Median Mean 3rd Qu. Max. IQR Valid NA's
  [1,]    2    2.00    2.0  2.0    2.00    2 0.0     1    0
  [2,]    2    2.00    2.0  2.0    2.00    2 0.0     1    0
  [3,]    2    2.00    2.0  2.0    2.00    2 0.0     1    0
  [4,]    2    2.00    2.0  2.0    2.00    2 0.0     1    0
  [5,]    4    4.00    4.0  4.0    4.00    4 0.0     1    0
  [6,]    3    3.00    3.0  3.0    3.00    3 0.0     1    0
[...]

Maybe that can be fixed by changing the default break_ties to something that actually breaks them as you suggest?

from cutpointr.

Thie1e avatar Thie1e commented on May 30, 2024

This behavior is basically a compromise:

If we break ties per default by mean or median, this will sometimes lead to suboptimal cutpoints. That's the main reason why the default is break_ties = c. For example, with the above code and break_ties = mean we'll get sensitivity + specificity = 1.735 instead of 1.759:

> opt_cut_b$boot[[1]]$optimal_cutpoint[[72]]
[1] 3
> opt_cut_b$boot[[1]]$sum_sens_spec_b[[72]]
[1] 1.734818

The drawback is obviously that some columns are not type stable with break_ties = c. I agree that functions like summary don't work well with nested tibbles or list columns out of the box.

By the way, I cannot reproduce the output above. Maybe a difference in package or R versions? For me it looks like this:

summary(opt_cut_b$boot[[1]]$optimal_cutpoint)
[...]
       Length Class  Mode   
[...]
[70,] 1      -none- numeric
[71,] 1      -none- numeric
[72,] 2      -none- numeric
[73,] 1      -none- numeric
[74,] 1      -none- numeric
[75,] 1      -none- numeric
[...]

At least there are no misleading numeric results here.

The question is, if we rather want type stability of all columns or mathematically optimal cutpoints in all scenarios. After thinking about this, the only easily implementable option for the former that I can imagine would be to change the default to break_ties = median (thus still returning the optimal metric value if there is an odd number of optimal cutpoints) and to print a message if multiple optimal cutpoints are found and the function in break_ties is applied.

from cutpointr.

xrobin avatar xrobin commented on May 30, 2024

I can see the same thing as you after upgrading the following packages:

1: assertthat (0.2.0 -> 0.2.1) [CRAN]
2: cli (1.0.1 -> 1.1.0) [CRAN]
3: colorspace (1.4-0 -> 1.4-1) [CRAN]
4: ggplot2 (3.1.0 -> 3.1.1) [CRAN]
5: glue (1.3.0 -> 1.3.1) [CRAN]
6: gtable (0.2.0 -> 0.3.0) [CRAN]
7: lazyeval (0.2.1 -> 0.2.2) [CRAN]
8: purrr (0.3.0 -> 0.3.2) [CRAN]
9: Rcpp (1.0.0 -> 1.0.1) [CRAN]
10: rlang (0.3.1 -> 0.3.4) [CRAN]
11: stringi (1.3.1 -> 1.4.3) [CRAN]
12: tibble (2.0.1 -> 2.1.1) [CRAN]
13: tidyr (0.8.2 -> 0.8.3) [CRAN]

One of them must have changed something.

I agree the mean or median isn't a good option, it would give you values that don't actually exist. A warning could be nice but will only get you so far. So I guess that's just the intended behavior, nothing to worry about just something weird. Thanks for the help!

from cutpointr.

Thie1e avatar Thie1e commented on May 30, 2024

I tried to check which package might have caused the different output of summary but could not figure it out. I don't think it has to do with any of the above packages. If we look at methods(summary) there's no summary.list that might cause a different output (class(opt_cut$boot[[1]]$optimal_cutpoint) is simply list). The output I have shown comes from summary.default.

Maybe you had loaded a different package that has a summary.list function?

from cutpointr.

xrobin avatar xrobin commented on May 30, 2024

Hum... I can see a few packages providing summary.list, but none that would've been loaded (or even installed).

from cutpointr.

Thie1e avatar Thie1e commented on May 30, 2024

OK, thanks for looking into this again. I checked and it's not the function from plink. Anyway, there's probably not much we could do if a summary.list function was loaded by the user.

from cutpointr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.