thie1e / cutpointr Goto Github PK

View Code? Open in Web Editor NEW

83.0 4.0 13.0 11.56 MB

Optimal cutpoints in R: determining and validating optimal cutpoints in binary classification

Home Page: https://cran.r-project.org/package=cutpointr

R 97.92% C++ 2.08%

r roc-curve cutpoint-optimization bootstrapping

cutpointr's People

Contributors

Stargazers

Watchers

Forkers

kormilitzin aarthyv mutual-ai guhjy muschellij2 romainfrancois maltarel abreums raysunau jnaranjo-sciling minghao2016 rnaimehaom jystatistics

cutpointr's Issues

direction parameter in the cutpointr()

Dear Dr. Thiele @Thie1e

We have a quick question about allowable inputs for parameters of cutpointr(). direction currently only accepts" >=" or "<=" but not ">" or "<". We are trying to match the cutpointr output with output from another tool and we figured out the difference in the ">=" in cutpointr and ">" in the other tool. Is there any way we can set ">" in cutpointr()? Any advice is greatly appreciated.

Thanks so much for your help.

Installation problem

The package looks very exciting, however, when I run the following code, I get the error:

> install.packages('cutpointr')
Warning in install.packages :
  package ‘cutpointr’ is not available (for R version 3.4.3)

My machine:

sessionInfo()

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Any ideas?

An ambiguous region bounded by two cutpoint

Hi Christian:

When developing a diagnostic test, it is common to have three regions: eg high confident negative, ambiguous, and high confident positive. Two cutpoints cp1 and cp2 would accomplish this. Cutpointr current only accepts one cp as: method = oc_manual, cutpoint = cp. I ended up preparing two data, data1 includes only the ambiguous middle and data2 without those falling in the middle region.

It would be great if cutpointr could support something like:

opt_cut <- cutpointr(data=suicide, x=age, class=suicide,
direction = ">=", pos_class = "yes", neg_class = "no",
method = oc_manual, cutpoint = c(30, 40))

Allow bootstrap stratification for maximize_boot_metric and minimize_boot_metric

Currently, the bootstrap that is run within maximize_boot_metric and minimize_boot_metric is non-stratified, which can obviously cause problems if there is a low number of positives or a low number of negatives (#33), so we should have the option to also stratify that bootstrap, not only the outer one.

minor mistake in readme

In section Metric Functions "cost_misclassification" should be changed to "misclassification_cost"

How to access ppv values given a custom cutpoint

Hi Christian,

Using the following code, I can get to opt_cut for a custom cutpoint. How do I access the PPV and NPV values from the 100 samplings?
I'd like to plot the PPV or NPV distribution. I figure it has something to do with opt_cut$boot but I am not able to figure it out. Thanks much.

`
library(cutpointr)
library(dplyr)

Optimal cutpoint for dsi

data(suicide)
opt_cut <- cutpointr(data = suicide, x = dsi, class = suicide,
pos_class = 'yes', neg_class = 'no',
direction = '>=', boot_stratify = TRUE,
method = oc_manual, cutpoint = 3, boot_runs = 100) %>%
cutpointr::add_metric(list(ppv, npv))
`
@Thie1e

Re-upload to CRAN

I don't know if you are aware of it already, but I just noticed that cutpointr was removed from CRAN a few days ago.

Specify a customer cutpoint using oc_manual=avalue ignored?

Hi Christian,

I wonder if oc_manual is used to specify a custom cutpoint. I ran the following code to provide my own cutpoint instead of letting the function find the optimal cutpoint. Not sure if oc_manual is used correctly but the output ignores my input. Please advise!

`
data(suicide)
opt_cut <- cutpointr(data = suicide, x = dsi, class = suicide,
pos_class = 'yes', neg_class = 'no',
direction = '>=', boot_stratify = TRUE,
oc_manual = 2.5)

Thanks much @Thie1e

Inconsistent calculations using manual method

I am using the manual method to set specific cutpoints
Whilst the sensitivity and specificity changes with each manual outpoint, the AUC does not
Why isn't the AUC changing with each outpoint?
(Unless I have done something wrong?!)

thank you


> 
> > cp <- cutpointr(data, test, status, method = maximize_metric, metric = sum_sens_spec)
> Assuming the positive class is 1
> Assuming the positive class has higher x values
> > cp
>  A tibble: 1 x 16
>   direction optimal_cutpoint method          sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence
>   <chr>                <dbl> <chr>                   <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl>
> 1 >=                    21.3 maximize_metric       1.33777 0.670951    0.664894    0.672881 0.701632         1         0   0.241645
>   outcome predictor           data roc_curve           boot 
>   <chr>   <chr>     <list<df[,2]>> <list>              <lgl>
> 1 status  test          [778 × 2] <tibble [157 × 10]> NA   
> > opt_cut_manual <- cutpointr(data, test, status, method = oc_manual, cutpoint = 21, boot_runs = 30)
> Assuming the positive class is 1
> Assuming the positive class has higher x values
> Running bootstrap...
> >                             
> > cp
>  A tibble: 1 x 16
>   direction optimal_cutpoint method          sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence
>   <chr>                <dbl> <chr>                   <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl>
> 1 >=                    21.3 maximize_metric       1.33777 0.670951    0.664894    0.672881 0.701632         1         0   0.241645
>   outcome predictor           data roc_curve           boot 
>   <chr>   <chr>     <list<df[,2]>> <list>              <lgl>
> 1 status  test          [778 × 2] <tibble [157 × 10]> NA   
> > opt_cut_manual
>  A tibble: 1 x 16
>   direction optimal_cutpoint method    sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence outcome
>   <chr>                <dbl> <chr>             <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl> <chr>  
> 1 >=                      21 oc_manual       1.33655 0.664524    0.675532    0.661017 0.701632         1         0   0.241645 status 
>   predictor           data      roc_curve boot              
>   <chr>     <list<df[,2]>> <list<df[,9]>> <list>            
> 1 test          [778 × 2]      [157 × 9] <tibble [30 × 23]>
> > opt_cut_manual <- cutpointr(data, test, status, method = oc_manual, cutpoint = 20, boot_runs = 30)
> Assuming the positive class is 1
> Assuming the positive class has higher x values
> Running bootstrap...
> > opt_cut_manual
>  A tibble: 1 x 16
>   direction optimal_cutpoint method    sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence outcome
>   <chr>                <dbl> <chr>             <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl> <chr>  
> 1 >=                      20 oc_manual       1.32322 0.651671    0.680851    0.642373 0.701632         1         0   0.241645 status 
>   predictor           data      roc_curve boot              
>   <chr>     <list<df[,2]>> <list<df[,9]>> <list>            
> 1 test          [778 × 2]      [157 × 9] <tibble [30 × 23]>
> > opt_cut_manual <- cutpointr(data, test, status, method = oc_manual, cutpoint = 30, boot_runs = 30)
> Assuming the positive class is 1
> Assuming the positive class has higher x values
> Running bootstrap...
> > opt_cut_manual
>  A tibble: 1 x 16
>   direction optimal_cutpoint method    sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence outcome
>   <chr>                <dbl> <chr>             <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl> <chr>  
> 1 >=                      30 oc_manual       1.32584 0.722365    0.547872    0.777966 0.701632         1         0   0.241645 status 
>   predictor           data      roc_curve boot              
>   <chr>     <list<df[,2]>> <list<df[,9]>> <list>            
> 1 test          [778 × 2]      [157 × 9] <tibble [30 × 23]>
> > opt_cut_manual <- cutpointr(data, test, status, method = oc_manual, cutpoint = 40, boot_runs = 30)
> Assuming the positive class is 1
> Assuming the positive class has higher x values
> Running bootstrap...
> > opt_cut_manual
>  A tibble: 1 x 16
>   direction optimal_cutpoint method    sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence outcome
>   <chr>                <dbl> <chr>             <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl> <chr>  
> 1 >=                      40 oc_manual       1.29017 0.755784    0.430851    0.859322 0.701632         1         0   0.241645 status 
>   predictor           data      roc_curve boot              
>   <chr>     <list<df[,2]>> <list<df[,9]>> <list>            
> 1 test          [778 × 2]      [157 × 9] <tibble [30 × 23]>
> > opt_cut_manual <- cutpointr(data, test, status, method = oc_manual, cutpoint = 40)
> Assuming the positive class is 1
> Assuming the positive class has higher x values
> > opt_cut_manual
>  A tibble: 1 x 16
>   direction optimal_cutpoint method    sum_sens_spec      acc sensitivity specificity      AUC pos_class neg_class prevalence outcome
>   <chr>                <dbl> <chr>             <dbl>    <dbl>       <dbl>       <dbl>    <dbl>     <dbl>     <dbl>      <dbl> <chr>  
> 1 >=                      40 oc_manual       1.29017 0.755784    0.430851    0.859322 0.701632         1         0   0.241645 status 
>   predictor           data      roc_curve boot 
>   <chr>     <list<df[,2]>> <list<df[,9]>> <lgl>
> 1 test          [778 × 2]      [157 × 9] NA

Can we specify the bootstrap sampling size?

Hi Christian,

Is there any way to specify the sampling size in cutpointr() instead using the same sample size as input data?

Thanks much @Thie1e

Failure with dev tidyr

When I check cutpointr with the dev version of tidyr, I see:

checking examples ... ERROR

...

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

> library(cutpointr)
> cutpointr(suicide, dsi, suicide, gender) %>%
+   add_metric(list(ppv, npv)) %>%
+   select(optimal_cutpoint, subgroup, AUC, sum_sens_spec, ppv, npv)
Assuming the positive class is yes
Assuming the positive class has higher x values
Error in check_roc_curve(optcut) : 
  roc_curve as returned by the method function is not an object of the class roc_cutpointr
Calls: %>% ... cutpointr_internal -> <Anonymous> -> .f -> check_roc_curve
Execution halted

checking tests ...

 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  ══ testthat results  ═════════════════════════════════════════════════════════════════
  OK: 87 SKIPPED: 0 FAILED: 40
  1. Error: Cutpointr returns a cutpointr without NAs and a certain Nr of rows (@test-cutpointr.R#3) 
  2. Error: Cutpointr works with different data types (@test-cutpointr.R#19) 
  3. Error: Bootstrap does not return duplicate colnames (@test-cutpointr.R#78) 
  4. Error: Plotting with bootstrapping is silent (@test-cutpointr.R#94) 
  5. Error: AUC calculation is correct and works with Inf and -Inf (@test-cutpointr.R#134) 
  6. Error: Correct midpoints are found (@test-cutpointr.R#149) 
  7. Error: find_metric_name finds metric (@test-cutpointr.R#160) 
  8. Error: no duplicate column names are returned (@test-cutpointr.R#182) 
  9. Error: Correct cutpoints with example data (@test-cutpointr.R#212) 
  1. ...
  
  Error: testthat unit tests failed
  Execution halted

checking re-building of vignette outputs ... WARNING

Error in re-building vignettes:
  ...
Quitting from lines 46-52 (cutpointr.Rmd) 
Error: processing vignette 'cutpointr.Rmd' failed with diagnostics:
roc_curve as returned by the method function is not an object of the class roc_cutpointr
Execution halted

Would you mind looking into this for me? It's possible that I've accidentally changed the API tidyr in someway but the changes are small and cutpointr is the only CRAN package that shows problems.

Calculating confidence intervals in cutpointr

Hi,
Is there a way to calculate 95% CIs for PPV and NPV in cutpointr? I have managed to do this for the AUC but not PPV or NPV.
Thanks

Confidence Intervals for ROC curves

CI functionality could be added to plot_roc, see #43.

Error in !predictor : invalid argument type

Hi @Thie1e,

I'm running the example and I'm having this error, I cannot understand what I'm doing wrong... I just follow the example! Thanks!

> cp <- cutpointr(suicide, dsi, suicide, method = maximize_metric, metric = sum_sens_spec)
Assuming the positive class is yes
Assuming the positive class has higher x values
Error in !predictor : invalid argument type

Use patchwork instead of gridExtra

As soon as the patchwork package is on CRAN, plot.cutpointr should use that package instead of gridExtra. That way, all plots that are generated by cutpointr will be ggplot objects and additionally the resulting plot can be further modified.

more than one metric function on the table

is it possible to report more than one metric function on the columns of the result of cutpointr? Like PPV, NPV, Accuracy?
thank you

plot_sensitivity_specificity not ploting

Something is wrong with the plot_sensitivity_specificity command. When I do this I get an empty ggplot estructure with a vertical line on the number 2.

Is this correct?

library(cutpointr)
data(suicide)
opt_cut <- cutpointr(suicide, dsi, suicide)
plot_sensitivity_specificity(opt_cut)

Thanx

Prepare for tidyr v1.0.0

tidyr is preparing for the release of what will become v1.0.0.

Here's what we see for cutpointr in our initial revdep check:

https://github.com/tidyverse/tidyr/blob/master/revdep/problems.md#cutpointr

I took a quick look at your dev version and it seems the tidyr::nest_() calls are gone, which is good. I'm seeing other errors ("Tibble columns must have consistent lengths"), though, which keep from verifying if all is well with respect to tidyr.

It looks like you are perhaps building towards a release anyway. So I suggest that, while you are at it, you also make sure you're testing against the devel version of tidyr. Let me know if you need any pointers.

Here is advice on how to test your package on travis against the devel version of tidyr during this period:

tidyverse/tidyr#692 (comment)

Explain oc_youden Kernel

Dear @Thie1e,

When using cutpointr, I obtained better results when using method = oc_youden_kernel becasue the two classes are not normally distributed. When using default method = maximize_metric, is empical cdf method used? Thanks much.

`use_midpoints = TRUE` per default?

The current default is use_midpoints = FALSE. However, as the Readme and the vignette illustrate, this leads to a certain bias when methods other than the normal or kernel method are applied. On the other hand, many users might prefer the current setting, because the returned cutpoints can be found in the data. Additionally, changing the default might surprise current users. Should it be changed or not?

Add option to plot functions to produce facets

As suggested in #8, we could plot facets per subgroup instead of distinguishing subgroups by linetype.

95% confidence intervals instead of getting limits at 5% and 95% in summary of cutpointr

I received the question how to get the 95% confidence intervals instead of getting only the limits at 5% and 95% from the summary output. I'm posting a solution for calculating bootstrap quantiles here for future reference.

library(cutpointr)
library(tidyverse)

cp1 <- cutpointr(suicide$dsi, suicide$suicide, suicide$gender, 
                 boot_runs = 1000, boot_stratify = TRUE,
                 na.rm=TRUE)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Running bootstrap...

boot_ci(cp1, acc, in_bag = F, alpha = 0.05) %>% 
  mutate(variable = "acc")
#> # A tibble: 4 x 4
#>   subgroup quantile values variable
#>   <chr>       <dbl>  <dbl> <chr>   
#> 1 female      0.025  0.779 acc     
#> 2 female      0.975  0.931 acc     
#> 3 male        0.025  0.630 acc     
#> 4 male        0.975  0.926 acc

cp1 %>% 
  select(subgroup, boot) %>%
  unnest(boot) %>% 
  dplyr::select(-(TP_b:roc_curve_oob)) %>% 
  pivot_longer(-subgroup) %>% 
  group_by(subgroup, name) %>% 
  summarise(lower_ci = quantile(value, 0.025, na.rm = TRUE),
            upper_ci = quantile(value, 0.975, na.rm = TRUE))
#> `summarise()` regrouping output by 'subgroup' (override with `.groups` argument)
#> # A tibble: 26 x 4
#> # Groups:   subgroup [2]
#>    subgroup name             lower_ci upper_ci
#>    <chr>    <chr>               <dbl>    <dbl>
#>  1 female   acc_b               0.798    0.936
#>  2 female   acc_oob             0.779    0.931
#>  3 female   AUC_b               0.893    0.979
#>  4 female   AUC_oob             0.882    0.988
#>  5 female   cohens_kappa_b      0.324    0.625
#>  6 female   cohens_kappa_oob    0.287    0.605
#>  7 female   optimal_cutpoint    1        4    
#>  8 female   sensitivity_b       0.815    1    
#>  9 female   sensitivity_oob     0.636    1    
#> 10 female   specificity_b       0.786    0.937
#> # ... with 16 more rows

^{Created on 2020-12-22 by the reprex package (v0.3.0)}

Update of benchmark data

Dear Christian,

Thanks for the benchmarks that are performed in the vignette. I've been looking into why pROC is significantly slower, was able to identify and fix some of the bottlenecks. I'm planning to release it around the end of the month and will propose a pull request once it is on CRAN if that's OK with you.

I was wondering if speed was the only reason to exclude pROC with > than 1e5 observations, or if memory was also a factor. In the vignette you write:

OptimalCutpoints and ThresholdROC had to be excluded from benchmarks with
more than 1e4 observations and pROC from benchmarks with more than 1e5
observations due to high memory requirements and/or excessive run times

Do you remember if memory was a criteria for pROC and if so what was your criteria exactly, or if it was only a run time reason? I have been able to run the benchmark with 1e7 data points in pROC without any noticable memory issue. Here is what it looks like:

PS: in the second plot, you run only the ROCR::prediction function, and not ROCR::performance which is necessary to get the sensitivities and specificities. Is there a reason for that?

Call functions from namespace

Hi,
Thank you so much for cutpointr.

It's more a question than a bug, perhaps. I may have missed something obvious, but when I call cutpointr functions from its namespace (as I would for inclusion in my own package), I get an error. Here's my minimal reproducible example:

library(cutpointr)
#> Warning: package 'cutpointr' was built under R version 3.5.2
opt_cut_namespace <- cutpointr::cutpointr_(
  suicide,
  "dsi",
  "suicide",
  method = cutpointr::minimize_boot_metric,
  metric = cutpointr::false_omission_rate
)
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Error in `$<-.data.frame`(`*tmp*`, "method", value = c("::", "cutpointr", : replacement has 3 rows, data has 1
sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17134)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
#> [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
#> [5] LC_TIME=English_Canada.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] cutpointr_0.7.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.18     knitr_1.20       bindr_0.1.1      magrittr_1.5    
#>  [5] tidyselect_0.2.4 R6_2.2.2         rlang_0.2.1      foreach_1.4.4   
#>  [9] stringr_1.3.1    dplyr_0.7.6      tools_3.5.1      iterators_1.0.10
#> [13] htmltools_0.3.6  yaml_2.2.0       rprojroot_1.3-2  digest_0.6.15   
#> [17] assertthat_0.2.0 tibble_1.4.2     crayon_1.3.4     bindrcpp_0.2.2  
#> [21] tidyr_0.8.1      purrr_0.2.5      codetools_0.2-15 glue_1.3.0      
#> [25] evaluate_0.11    rmarkdown_1.10   stringi_1.1.7    compiler_3.5.1  
#> [29] pillar_1.3.0     backports_1.1.2  pkgconfig_2.0.1

What's the recommended way, if any, to do this? Thanks again.

Best,
V.

Multiple cutoffs in a loop

Hi @Thie1e,

I hope you're doing well. I'm trying to run cutpointr inside a loop and because rlang problems I totally locked...

Try 1: it seems that's because i is considered as a string

for(i in c("var1", "var2", "var3"){
  cutpointr(data = mydata, x = i, class = myclass, pos_class = 1, metric = sum_sens_spec)
}

Error in if (stats::median(x[class != pos_class]) < stats::median(x[class ==  : 
  missing value where TRUE/FALSE needed

Try 2: I force i to be the original variable....

for(i in c("var1", "var2", "var3"){
  cutpointr(data = mydata, x = eval(parse(text = i)), class = myclass, pos_class = 1, metric = sum_sens_spec)
}

Error: Can't convert a call to a string
Run `rlang::last_error()` to see where the error occurred.
<error/rlang_error>
Can't convert a call to a string
Backtrace:
 1. cutpointr::cutpointr(...)
 2. cutpointr:::cutpointr.default(...)
 3. rlang::as_name(rlang::enquo(x))
 4. rlang::as_string(x)
 5. rlang:::abort_coercion(x, "a string")
Run `rlang::last_trace()` to see the full context.

How can I solve, or at leas bypass, this problem and run it inside a loop, please? Thanks in advance.

Summary and plot don't work well with multi_cutpointr

Running summary on data from multi_cutpointr throws an error:

multi_cut <- multi_cutpointr(suicide, c("age", "dsi"), "suicide", subgroup="gender")
summary(multi_cut)
[...]
---------------------------------------------------------------------------------------------- 
Error in Math.data.frame(list(optimal_cutpoint = 56, method = "maximize_metric",  : 
  non-numeric variable(s) in data frame: method

The plot function is also behaving weird, and seems to be mixing both x variables in the same plot

plot(multi_cut)

Similar issues happen with bootstrap enabled (boot_runs > 0).

cutpointr() subgroup option how to determine opt_cut$boot list belonging to which subgroup?

Hi Christian,

Thanks much for the nice and fast package to make cupoint related calculations enjoyable!

I have a quick question. When using subgroup option in the cutpointr(), for example subgroup = 'gender', the opt_cut$boot would be a list of two items. How can I tell which item for gender = female and which is for gender = male? I guess it is based on alphabetical order therefore female in opt_cut$boot[[1]] and male in opt_cut$boot[[2]], is this correct?

Thanks a lot.

@Thie1e

NSE in multi_cutpointr?

Not a big deal but somewhat surprising, I can do:

cutpointr(suicide, dsi, class=suicide)

but not:

multi_cutpointr(suicide, dsi, class=suicide)
Error in multi_cutpointr(suicide, dsi, class = suicide) : 
  class should be the name of the outcome variable (character)

It would be nice to have NSE for multi_cutpointr too.

AUC confidence interval

Hi, is there a way of calculating the confidence interval of AUC with cutpointr?
Thank you

Cutpointr confidence interval predictive positive value

Hi,

I'm using cutpointr with boot_runs to get cutpoints , AUC, specificity and sensitivity with their respective confidence intervals. I also "manually" extracted the PPV and NPV, but I don't know how to get their respective confidence intervals.
Does someone has any idea ?

Thank you !

Documentation and cutpointr output suggestion

Thank you for developing this really great package! I have three suggestions/enhancements for this package.

First, the output to the cutpointr( ) function contains a column that is named after the metric that the user chooses. So if metric = "youden" then the column will be called youden. I suggest changing this column name to metric_value and adding a new column called metric that basically gives the character string of the metric used (in this case youden). This will enable the users to run the analysis for mutiple metrics, bind results together by row, and use tidyverse functions to wrangle the output.

Second, when adding in subgroups the cutpointr( ) output has a new column subgroup and therefore this no longer matches the function output from the non-subgroup cutpointr( ) case. I suggest adding a subgroup column with the character string none or something similar to the non-subgroup case so that these two outputs can be bound if the user wants to do this.

Last, this is sort of a question and a suggestion. Can you please change the boot_stratify explanation to say something like "...keeping a proportionate number of positives and negatives as in the full dataset when resampling"? As it's written now, it sounds like the number of positives is exactly equal to the number of negatives.

Thanks again!

Add function for more convenient calculation of CIs

We could add a function that makes it easier to calculate confidence intervals from bootstrap results, as was desired in #10

Creating a composite biomarker score using regression coefficients

Hello,

I wanted to create a composite biomarker score of 4 markers that I can then enter into further predictive analysis. I was planning on doing this by extracting the β coefficients from the multivariable logistic regression with all (standardized) biomarkers and then multiply those with the (standardized) biomarker levels to create a composite. I just wondered whether this method was ok and how I can go about this using R?

Thanks,
Laura

Auto-select numeric columns in multi_cutpointr

If I run multi_cutpointr on the suicide dataset without specifying x, I get the following error on a non-numeric column:

multi_cutpointr(suicide, class = "suicide")
age:
Assuming the positive class is no
Assuming the positive class has higher x values
gender:
Error in median.default(x[class == uc[1]]) : need numeric data

It is common to have non-numeric columns with different groupings etc. that may not be used right now. Just like the gender column. It would be a nice addition to auto-detect numeric columns only.

Missing metrics if maximize/minimize_boot_metric

Hi @Thie1e,

I'm here again 😓... I'm faceting to an error I never saw before:

> cutpointr(data = u, x = x, class = class, na.rm = T,
+           metric = sum_sens_spec, method = minimize_boot_metric, 
+           boot_cut = 100, boot_stratify = T, boot_runs = 100)
Assuming the positive class is 1
Assuming the positive class has higher x values
Error in optimize_metric(data = data[b_ind, ], x = x, class = class, metric_func = metric_func,  : 
  All metric values are missing.
In addition: Warning message:
In roc(data = data, x = !!x, class = !!class, pos_class = pos_class,  :
  ROC curve contains no positives

And this is my data structure:

structure(list(x = c(0.225, 1.936, 0.0315, 0.0078, 0.4698, 19.35, 
0.0531, 1.7466176, 0, 0.02350828, 0.0714725, 0.5275296, 7.68378, 
0.05376, 0.020688, 0.08143, 1.127828, 0, 0.0313956, NA, 0.04976592, 
30.072, 6.492, 2.99, 2.52, 0.17, 0.03321, 0.3306, 0, 0.884, NA, 
29.7, 1.4, 0, 0, 0.12320784, 1.108, 0.023104, 66.448512, 4.180792, 
0.5792, 0.0444, 0, 0.0392, 0, 0, 0.2105334, 0.225, 3.355, 23.4
), class = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 50L
), class = "data.frame")

I see that you updated the package version, maybe I'm missing something? Thanks for your help (again)!

Make printing of summary_cutpointr nicer in Rmd documents

When running summary() on a cutpointr (or multi_cutpointr) object in an Rmd, the output is split up into multiple output windows below the Rmd chunk, because the summary function prints the elements of the output one by one. The output should look the same as in the console. Maybe also round bootstrap summary to 2 instead of 3 decimal places to make the output narrower. Also, the ROC plot should say "by subgroup" instead of "by class" when subgroups are given.

Review / Change bootstrapping routine

Especially with imbalanced data sets that contain a low absolute number of observations of one of the two classes, some bootstrap samples will not contain observations of both classes and the cutpoint optimization cannot be run. There are several ways to deal with that. Currently, cutpointr uses option 1:

If a bootstrap sample contains only one class, redraw until a sample is drawn that contains both classes.
If a bootstrap sample contains only one class, return NA for all results of that bootstrap repetition. We did that in an older version, but it leads to the question how to deal with the results later, e.g. for plotting. Since many results may be missing, the plots of distributions may be misleading (based on a very low number of repetitions). We issued warnings in that case, but the constant warnings are confusing.
Sample with replacement separately from the positive and negative observations (stratified bootstrapping). This has the advantage that in every resample both classes will be contained, however the prevalence (= fraction of positive observations) is constant here.

My impression is that option 3 leads to worse confidence intervals than option 1. The cutpointr:::simple_boot function supports both schemes. To switch to option 3 the argument in simple_boot needs to be set and the code before it is called needs to be edited (some necessary lines for option 3 are currently commented out). A simulation study (different distributions of predictor values, different metrics) to check the coverage probabilities of confidence intervals from options 1 and 3 would be helpful here.

add_metric adds the metric column multiple times

See #60

[BUG]: cutpointr() not recognizing character vectors for "x" argument input via lapply()

I'm having many issues when trying to create cutpointr models through lapply().

This is an example that I find easy to understand why I think it should work but is not.

In this case, when I try to pass a character vector to the argument x of cutpoinitr() using an lapply() it is not recognizing the existence of such an object.

My way around it is to input predictions and class as vectors instead of in a data.frame, but would prefer to do it the way I suppose it is intended. Is there something that could be done?, maybe something that has to do with tidy evaluation eval_tidy()?

Best regards

Doubt about metric sum_sens_spec

Hi (again ::sweat::) @Thie1e,

I've a (possibly silly) question, if you don't mind, please. I ran cutpointr(..., metric = sum_sens_spec) and I don't understand how it calculates this cutoff because, when I do it manually my results, are a bit different.

Over the ct$roc_curve I have added a new col (ss) with sens + spec calculation (just for test what I'm saying)...

# A tibble: 379 x 10
   x.sorted    tp    fp    tn    fn     tpr   tnr   fpr   fnr    ss
      <dbl> <dbl> <dbl> <int> <int>   <dbl> <dbl> <dbl> <dbl> <dbl>
 1   Inf        0     0    29   349 0           1     0 1      1   
 2     6.32     1     0    29   348 0.00287     1     0 0.997  1.00
 3     5.17     2     0    29   347 0.00573     1     0 0.994  1.01
 4     4.88     3     0    29   346 0.00860     1     0 0.991  1.01
 5     4.78     4     0    29   345 0.0115      1     0 0.989  1.01
 6     4.47     5     0    29   344 0.0143      1     0 0.986  1.01
 7     4.27     6     0    29   343 0.0172      1     0 0.983  1.02
 8     4.12     7     0    29   342 0.0201      1     0 0.980  1.02
 9     4.00     8     0    29   341 0.0229      1     0 0.977  1.02
10     3.85     9     0    29   340 0.0258      1     0 0.974  1.03
# ... with 369 more rows

... and the x.sorted with higher value is distinct the one calculated by cutpointr. What I'm missing, please?

Columns nested in the `boot` column become columns of lists with allowParallel = TRUE and certain seeds

Consider the following code:

library(doSNOW)
library(dplyr)
library(tidyr)
cl <- makeCluster(2) # 2 cores
registerDoSNOW(cl)

set.seed(101)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 100,
					   silent = TRUE, allowParallel = TRUE)
opt_cut_b %>% select(boot) %>% unnest %>% head

This outputs a tibble with columns as expected:

# A tibble: 6 x 23
  optimal_cutpoint AUC_b AUC_oob sum_sens_spec_b sum_sens_spec_o… acc_b acc_oob sensitivity_b sensitivity_oob
             <dbl> <dbl>   <dbl>           <dbl>            <dbl> <dbl>   <dbl>         <dbl>           <dbl>
1                2 0.927   0.920            1.76             1.73 0.883   0.833         0.872           0.9  
2                4 0.960   0.894            1.90             1.51 0.940   0.857         0.966           0.632
3                2 0.905   0.956            1.73             1.75 0.850   0.862         0.886           0.889
4                2 0.958   0.881            1.84             1.66 0.874   0.854         0.969           0.8  
5                3 0.927   0.947            1.77             1.68 0.898   0.876         0.867           0.8  
6                2 0.931   0.954            1.76             1.80 0.853   0.873         0.912           0.929
# … with 14 more variables: specificity_b <dbl>, specificity_oob <dbl>, kappa_b <dbl>, kappa_oob <dbl>, TP_b <dbl>,
#   FP_b <dbl>, TN_b <int>, FN_b <int>, TP_oob <dbl>, FP_oob <dbl>, TN_oob <int>, FN_oob <int>, roc_curve_b <list>,
#   roc_curve_oob <list>

Now at times with certain seeds:

set.seed(102)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 100,
					   silent = TRUE, allowParallel = TRUE)
opt_cut_b %>% select(boot) %>% unnest

Notice how the columns are now holding <list>s of <dbl [1]>

# A tibble: 100 x 23
   optimal_cutpoint AUC_b AUC_oob sum_sens_spec_b sum_sens_spec_o… acc_b acc_oob sensitivity_b sensitivity_oob
   <list>           <dbl>   <dbl> <list>          <list>           <lis> <list>  <list>        <list>         
 1 <dbl [1]>        0.945   0.875 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 2 <dbl [1]>        0.959   0.877 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 3 <dbl [1]>        0.937   0.905 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 4 <dbl [1]>        0.927   0.955 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 5 <dbl [1]>        0.965   0.849 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 6 <dbl [1]>        0.909   0.976 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 7 <dbl [1]>        0.960   0.830 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 8 <dbl [1]>        0.908   0.959 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 9 <dbl [1]>        0.935   0.958 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
10 <dbl [1]>        0.891   0.951 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
# … with 90 more rows, and 14 more variables: specificity_b <list>, specificity_oob <list>, kappa_b <list>,
#   kappa_oob <list>, TP_b <list>, FP_b <list>, TN_b <list>, FN_b <list>, TP_oob <list>, FP_oob <list>, TN_oob <list>,
#   FN_oob <list>, roc_curve_b <list>, roc_curve_oob <list>

This does not seem to occur with allowParallel = FALSE.

Plot a the ROC curve with manual settings

Sorry, but I'm quite a newbie here and I don't know if this is really the right place to ask.

I wanted to just plot the ROC curve with the cutpoint and a vertival and horizontal line at the cutpoint.
In addition, I wanted to change the appearance of the cutpoint and the lines (ROC, vline, hline) with regard to type, size/with and color. I also wanted to print the AUC value and the cutpoint values into the graph.

However, I failed and finally I don't have any idea what to do with the tibbles. Can anyone help me please!

Many thanks in advance!

Jo1511

multiple cutpoints

Hi, I would like to know if you have any plans to make a function to return multiple cutpoints instead of one cutpoint?
What would your approach be in selecting multiple cutpoints?

How to include more than one predictors?

Hi Christian,

Does cutpointr support more than one predictors? In this dataset, for example, could I use dsi + age + gender as predictors together (not individually)? Then the cutpoint won't be the most optimal dsi but the probability for assigning the positive class.

`
opt_cut <- cutpointr(data = suicide, x = dsi, class = suicide,
pos_class = 'yes', neg_class = 'no',
direction = '>=', boot_runs = 100)

@Thie1e

Getting different f1-score result

Hi,

I am using your package to check my quick implementation of the dice/f1-score.

When I use the cutpointr() function and put the displayed optimal threshold (0.007) into my dice function I am getting a different result (0.6586022) than the cutpointr() function displays (0.672).

Do you work with approximations or is this a bug in my or your code?

# libraries
library(magrittr)
library(tibble)
#> Warning: Paket 'tibble' wurde unter R Version 3.5.1 erstellt
library(cutpointr)
#> Warning: Paket 'cutpointr' wurde unter R Version 3.5.1 erstellt
# testdata
set.seed(123)
df_test <- tibble(pred = runif(1000,0,1),
                  resp = sample(c(0,1), size = 1000, replace = TRUE))
# functions
binary_outcome <- function(x, thr){
  
  x[x >= thr] <- 1L
  x[x  < thr] <- 0L
  x
}
true_positive <- function(bo, label){
  bo <- as.integer(bo); label <- as.integer(label)
  sum(bo == 1L & label == 1L)
}
false_positive <- function(bo, label){
  bo <- as.integer(bo); label <- as.integer(label)
  sum(bo == 1L & label == 0L)
}
true_negative <- function(bo, label){
  bo <- as.integer(bo); label <- as.integer(label)
  sum(bo == 0L & label == 0L)
}
false_negative <- function(bo, label){
  bo <- as.integer(bo); label <- as.integer(label)
  sum(bo == 0L & label == 1L)
}
dice <- function(pred, label, thr){
  label <- as.integer(label)
  bo <- binary_outcome(pred, thr)
  
  tp <- true_positive(bo, label)
  fp <- false_positive(bo, label)
  tn <- true_negative(bo, label)
  fn <- false_negative(bo, label)
  
  2 * tp /(2 * tp + fn + fp)
}

# Test
cp <- cutpointr(data = df_test, x = pred, class = resp, 
                method = maximize_metric, metric = F1_score)
#> Assuming the positive class is 0
#> Assuming the positive class has higher x values
cp
#> # A tibble: 1 x 16
#>   direction optimal_cutpoint method          F1_score   acc sensitivity
#>   <chr>                <dbl> <chr>              <dbl> <dbl>       <dbl>
#> 1 >=                 0.00700 maximize_metric    0.672 0.509       0.998
#>   specificity   AUC pos_class neg_class prevalence outcome predictor
#>         <dbl> <dbl>     <dbl>     <dbl>      <dbl> <chr>   <chr>    
#> 1      0.0141 0.521         0         1      0.503 resp    pred     
#>   data                 roc_curve                 boot 
#>   <list>               <list>                    <lgl>
#> 1 <tibble [1,000 x 2]> <data.frame [1,001 x 10]> NA
dice(pred = df_test$pred, label = df_test$resp, thr = 0.00700)
#> [1] 0.6586022

# including the precise cutpoint
dice(pred = df_test$pred, label = df_test$resp, thr = cp$optimal_cutpoint)
#> [1] 0.6581598

Created on 2018-07-20 by the reprex package (v0.2.0).

Set manual color for only one line

Dear Christian @Thie1e

I ran cutpointr with subgroup and bootstrap. Then I draw lines corresponding to each subgroup. Because one of the lines is my reference so I'd like to set a color for only this line. To illustrate what I try to achieve, I use the sample dataset as following. As shown, I am able to manually set two colors for gender. However, what I really want is to continue using default color for all lines except my reference line. The reason is that I will have many lines (subgroup) and I'd like to see clearly my reference line among the rest of the lines.

Any advice how I can achive this?

Thanks much.

`
library(cutpointr)
library(dplyr)
library(ggplot2)

data(suicide)

opt_cut <- cutpointr(data=suicide, x=dsi, class=suicide, direction = ">=", pos_class = "yes",
neg_class = "no", subgroup = gender,
method = maximize_metric, metric = youden,
boot_runs = 100) %>%
add_metric(list(cohens_kappa))

plot_cutpointr(opt_cut, xvar = cutpoint,
yvar = cohens_kappa,
conf_lvl = 0.95,
aspect_ratio = NULL) +
scale_x_continuous(n.breaks=20, minor_breaks = waiver()) +
scale_y_continuous(n.breaks=5, minor_breaks = waiver()) +
scale_color_manual(values = c("#353436",
"#02e302")) +
scale_fill_manual(values = c("#353436",
"#02e302"))

thie1e / cutpointr Goto Github PK

cutpointr's People

Contributors

Stargazers

Watchers

Forkers

cutpointr's Issues

Optimal cutpoint for dsi

Recommend Projects

Recommend Topics

Recommend Org