Comments (7)
Hi, this happens because by default break_ties = c
, so if multiple optimal cutpoints are found, all of them are returned. The metric values in e.g. sum_sens_spec_b
correspond to the individual optimal cutpoints, so they become list columns, too.
In your example, there are multiple optimal cutpoints in bootstrap repetition 72:
opt_cut_b$boot[[1]]$optimal_cutpoint[[72]]
[1] 4 2
...and they both lead to the same in-sample metric value, as expected:
opt_cut_b$boot[[1]]$sum_sens_spec_b[[72]]
[1] 1.759109 1.759109
You can avoid multiple optimal cutpoints by setting break_ties = mean
or break_ties = median
for example:
set.seed(102)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 100,
silent = TRUE, allowParallel = T, break_ties = mean)
opt_cut_b %>% select(boot) %>% unnest
# A tibble: 100 x 23
optimal_cutpoint AUC_b AUC_oob sum_sens_spec_b sum_sens_spec_o… acc_b acc_oob sensitivity_b
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 0.945 0.875 1.79 1.68 0.880 0.833 0.917
2 2 0.959 0.877 1.83 1.68 0.867 0.874 0.971
3 2 0.937 0.905 1.76 1.75 0.897 0.829 0.861
4 2 0.927 0.955 1.72 1.87 0.867 0.873 0.854
5 2 0.965 0.849 1.83 1.62 0.870 0.841 0.969
6 2 0.909 0.976 1.73 1.88 0.863 0.890 0.865
7 2 0.960 0.830 1.81 1.54 0.865 0.868 0.949
8 2 0.908 0.959 1.73 1.82 0.836 0.897 0.9
9 2 0.935 0.958 1.75 1.85 0.874 0.856 0.875
10 2 0.891 0.951 1.70 1.75 0.872 0.836 0.829
# … with 90 more rows, and 15 more variables: sensitivity_oob <dbl>, specificity_b <dbl>,
# specificity_oob <dbl>, kappa_b <dbl>, kappa_oob <dbl>, TP_b <dbl>, FP_b <dbl>, TN_b <int>, FN_b <int>,
# TP_oob <dbl>, FP_oob <dbl>, TN_oob <int>, FN_oob <int>, roc_curve_b <list>, roc_curve_oob <list>
Multiple optimal cutpoints can of course be found regardless of allowParallel
, e.g. this returns multiple optimal cutpoints when allowParallel = F
:
set.seed(222)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 1000,
silent = TRUE, allowParallel = F)
opt_cut_b %>% select(boot) %>% unnest
The results differ with allowParallel = F
and allowParallel = T
, however, with the same seed, because the random number generator is different when parallelization is activated. In that case we use the doRNG
package for reproducible parallel foreach
loops.
from cutpointr.
I see. I guess the main problem is that this breaks some of the pipelines described in the README file and the vignette.
For instance the example in "Accessing data, roc_curve, and boot":
set.seed(123)
opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 20)
will quickly turn into something rather useless with only minor and unpredictable changes:
set.seed(222)
opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 100)
summary(opt_cut$boot[[1]]$optimal_cutpoint)
Min. 1st Qu. Median Mean 3rd Qu. Max. IQR Valid NA's
[1,] 2 2.00 2.0 2.0 2.00 2 0.0 1 0
[2,] 2 2.00 2.0 2.0 2.00 2 0.0 1 0
[3,] 2 2.00 2.0 2.0 2.00 2 0.0 1 0
[4,] 2 2.00 2.0 2.0 2.00 2 0.0 1 0
[5,] 4 4.00 4.0 4.0 4.00 4 0.0 1 0
[6,] 3 3.00 3.0 3.0 3.00 3 0.0 1 0
[...]
Maybe that can be fixed by changing the default break_ties
to something that actually breaks them as you suggest?
from cutpointr.
This behavior is basically a compromise:
If we break ties per default by mean
or median
, this will sometimes lead to suboptimal cutpoints. That's the main reason why the default is break_ties = c
. For example, with the above code and break_ties = mean
we'll get sensitivity + specificity = 1.735 instead of 1.759:
> opt_cut_b$boot[[1]]$optimal_cutpoint[[72]]
[1] 3
> opt_cut_b$boot[[1]]$sum_sens_spec_b[[72]]
[1] 1.734818
The drawback is obviously that some columns are not type stable with break_ties = c
. I agree that functions like summary
don't work well with nested tibbles or list columns out of the box.
By the way, I cannot reproduce the output above. Maybe a difference in package or R versions? For me it looks like this:
summary(opt_cut_b$boot[[1]]$optimal_cutpoint)
[...]
Length Class Mode
[...]
[70,] 1 -none- numeric
[71,] 1 -none- numeric
[72,] 2 -none- numeric
[73,] 1 -none- numeric
[74,] 1 -none- numeric
[75,] 1 -none- numeric
[...]
At least there are no misleading numeric results here.
The question is, if we rather want type stability of all columns or mathematically optimal cutpoints in all scenarios. After thinking about this, the only easily implementable option for the former that I can imagine would be to change the default to break_ties = median
(thus still returning the optimal metric value if there is an odd number of optimal cutpoints) and to print a message if multiple optimal cutpoints are found and the function in break_ties
is applied.
from cutpointr.
I can see the same thing as you after upgrading the following packages:
1: assertthat (0.2.0 -> 0.2.1) [CRAN]
2: cli (1.0.1 -> 1.1.0) [CRAN]
3: colorspace (1.4-0 -> 1.4-1) [CRAN]
4: ggplot2 (3.1.0 -> 3.1.1) [CRAN]
5: glue (1.3.0 -> 1.3.1) [CRAN]
6: gtable (0.2.0 -> 0.3.0) [CRAN]
7: lazyeval (0.2.1 -> 0.2.2) [CRAN]
8: purrr (0.3.0 -> 0.3.2) [CRAN]
9: Rcpp (1.0.0 -> 1.0.1) [CRAN]
10: rlang (0.3.1 -> 0.3.4) [CRAN]
11: stringi (1.3.1 -> 1.4.3) [CRAN]
12: tibble (2.0.1 -> 2.1.1) [CRAN]
13: tidyr (0.8.2 -> 0.8.3) [CRAN]
One of them must have changed something.
I agree the mean or median isn't a good option, it would give you values that don't actually exist. A warning could be nice but will only get you so far. So I guess that's just the intended behavior, nothing to worry about just something weird. Thanks for the help!
from cutpointr.
I tried to check which package might have caused the different output of summary
but could not figure it out. I don't think it has to do with any of the above packages. If we look at methods(summary)
there's no summary.list
that might cause a different output (class(opt_cut$boot[[1]]$optimal_cutpoint)
is simply list
). The output I have shown comes from summary.default
.
Maybe you had loaded a different package that has a summary.list
function?
from cutpointr.
Hum... I can see a few packages providing summary.list, but none that would've been loaded (or even installed).
from cutpointr.
OK, thanks for looking into this again. I checked and it's not the function from plink
. Anyway, there's probably not much we could do if a summary.list
function was loaded by the user.
from cutpointr.
Related Issues (20)
- Cutpointr confidence interval predictive positive value HOT 2
- Missing metrics if maximize/minimize_boot_metric HOT 2
- Allow bootstrap stratification for maximize_boot_metric and minimize_boot_metric HOT 1
- Make printing of summary_cutpointr nicer in Rmd documents HOT 1
- 95% confidence intervals instead of getting limits at 5% and 95% in summary of cutpointr HOT 1
- Documentation and cutpointr output suggestion HOT 3
- Confidence Intervals for ROC curves
- Plot a the ROC curve with manual settings HOT 4
- cutpointr() subgroup option how to determine opt_cut$boot list belonging to which subgroup? HOT 2
- Specify a customer cutpoint using oc_manual=avalue ignored? HOT 2
- Can we specify the bootstrap sampling size? HOT 2
- How to access ppv values given a custom cutpoint HOT 2
- How to include more than one predictors? HOT 5
- Calculating confidence intervals in cutpointr HOT 1
- Creating a composite biomarker score using regression coefficients HOT 2
- direction parameter in the cutpointr() HOT 2
- Set manual color for only one line HOT 3
- add_metric adds the metric column multiple times
- An ambiguous region bounded by two cutpoint
- Explain oc_youden Kernel
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cutpointr.