ludvigolsen / cvms Goto Github PK

R Package: Cross-validate one or multiple gaussian or binomial regression models at once. Perform repeated cross-validation. Returns results in a tibble for easy comparison, reporting and further analysis.

License: Other

R 99.99% CSS 0.01%

cvms's Introduction

Welcome alien overlord! I'm Ludvig Olsen

I'm an R and python developer with an MSc in Cognitive Science, working as a PhD student at the Department of Molecular Medicine (MOMA) at Aarhus University Hospital. Current projects revolve around detection and localization of cancers from cell-free DNA.

I have worked with deep learning on text at UNSILO and medical imaging data at Cercare Medical.

The goal of my career is to have a very big positive impact on this world of ours! 🌍

R packages

I have developed a set of open source R packages. Feel free to check them out :-)

groupdata2	Divide data into groups Create balanced partitions and cross-validation folds Perform time series windowing Balance existing groups with up- and downsampling
cvms	Cross-validate regression and classification models Evaluate predictions with a tidy output Extract challenging observations Find baselines for a wide range of metrics
xpectr	Generate testthat tests RStudio addins for faster development
rearrr	Rearrrange data points - center max value, roll elements, shuffle hierarchy, ... Mutate data points - rotate, swirl, cluster, roll values, ... Scaling and measuring utilities - MinMax scaling, find angle, find centroid, ...

GitHub stats

cvms's People

Contributors

Stargazers

Watchers

Forkers

gridl bbolker minghao2016 321hg strengejacke aramsafrast eliocamp

cvms's Issues

problem with confusion matrix plot when adding sum tiles.

Hi Ludvig,
problem with CM plot only occurs when I am trying to add sum tiles.

library(cvms)
library(broom)    # tidy()
library(tibble)   # tibble()
library(ggimage)   # tibble()
#> Loading required package: ggplot2
library(rsvg)   # tibble()

set.seed(1)
d_multi <- tibble("target" = floor(runif(100) * 3),
                  "prediction" = floor(runif(100) * 3))
conf_mat <- confusion_matrix(targets = d_multi$target,
                             predictions = d_multi$prediction)

plot_confusion_matrix(conf_mat$`Confusion Matrix`[[1]], add_sums = TRUE)
#> Error in plot_confusion_matrix(conf_mat$`Confusion Matrix`[[1]], add_sums = TRUE): unused argument (add_sums = TRUE)
plot_confusion_matrix(
  conf_mat$`Confusion Matrix`[[1]],
  add_sums = TRUE,
  sums_settings = sum_tile_settings(
    palette = "Oranges",
    label = "Total",
    tc_tile_border_color = "black"
  )
)
#> Error in plot_confusion_matrix(conf_mat$`Confusion Matrix`[[1]], add_sums = TRUE, : unused arguments (add_sums = TRUE, sums_settings = sum_tile_settings(palette = "Oranges", label = "Total", tc_tile_border_color = "black"))

^{Created on 2021-01-19 by the reprex package (v0.3.0)}
Can you please help me here?

P.S: I am reproducing your code mentioned here

combine_predictors

max_fixed_effects = 1 should be allowed. Also multiple dependent variables.

combine_predictors("y2", c("x1", "x2"), max_fixed_effects = 1, max_interaction_size = 0)
## Error in combine_predictors("y2", c("x1", "x2"), max_fixed_effects = 1,  : 
## 1 assertions failed:
##  * Variable 'max_fixed_effects': Element 1 is not >= 2.

Move CI to Github Actions

As all the free-version Travis CI credits have been used, it seems necessary to move on to a different solution. Try moving it to Github Actions instead. This might be a good time to clean up the builds as well.

Name correction

Change name to "Hugh Benjamin" from "Benjamin Hugh" in DESCRIPTION

Inconsistency in nomenclature

Thanks for a super cool package.
I was working with evaluate() and plot_confustion_matrix() tonight, and I noticed some inconsistency in the arguments. It might be intentional for reasons I don't understand, but it might be useful to have either target_col OR targets_col (same with predictions) in both functions.

`evaluate(
  data,
  target_col,
  prediction_cols,
  type)`

plot_confusion_matrix(
  conf_matrix,
  targets_col = "Target",
  predictions_col = "Prediction",
  counts_col = "N"
)

Extract fitted models

Is there a way to extract fitted models? I would like to test my model on data which it haven't seen to get true metrics and I cannot find a way to do it.

Feature request: Only show data in sub_col on confusion matrix plot

Hi,

thank you for the nice package. It is really easy to work with and there are many nice features included.

I'm trying to use the plot_confusion_matrix() function. However my counts differ a lot. Because of that, some small counts are not distinguishable in the color scheme. I fixed this with using logarithmic counts instead.
I was even able to show the original counts on the confusion matrix with the sub_col argument. However, I don't want to show the percentages, just the raw counts. When I set 'add_normalized' to false I can only show the logarithmic counts and not the original ones. It would be really nice to have the possibility to only show the values inside the sub_col argument on the plot.

Best,
Marie

Related to issue 35

Dear @LudvigOlsen ,
As of today, the following doesn't seem to be supported any more:

plot_confusion_matrix(
conf_mat,
font_row_percentages=font(prefix=c("NPV = ", "", "", "PPV = ")),
font_col_percentages = font(prefix=c("Spec = ", "", "", "Sens = "))
)

The error messages are:

Error in element_text(size = size, color = color, face = face, family = family, :
unused argument (prefix = c("NPV=", "", "", "PPV="))

Error in element_text(size = size, color = color, face = face, family = family, :
unused argument (prefix = c("NPA=", "", "", "PPA="))

The error is related to ggplot2 version ggplot2_3.4.4. The code works in ggplot2_3.4.3.

preinstalled interfaces

Suggest that the core packages identified in the CRAN Machine Learning Task View be provided out-of-the-box.

These are listed near the bottom of the Task View as Core and are abess, e1071, gbm, kernlab, mboost, nnet, randomForest, and rpart.

combine_predictors() does not generate all possible formulas

Currently, combine_predictors() only allows a fixed effect to be included in the formula once. As we use the asterisk "*" operator, this means that it can be included in one n-way interaction and all its lower order interactions (so: x1 * x2 * x3, x1 * x2, and x1 * x3, x2 * x3). But we can't have the following formula y ~ x1 * x2 + x1 * x3.

Another problem currently faced is that with more than 7 fixed effects, the function is very slow due to the many comparisons done when generating the formulas. I could maybe speed this up by using data.table instead of tidyverse for some parts of the code, but it would probably still be slow.

For the first problem, I have found a way to pairwise identify if "terms" (broadly defined here as either a fixed effect or a n-way interaction) are allowed in the same formula. For instance, the two terms "x1 * x2" and "x1 * x2 * x3" should not be in the same formula, as the first is already generated automatically by including the second. The terms "x1 * x2" and "x1 * x3 * x4" could be in the same formula, because they include different effects.
With all the possible combinations of terms this isn't that fast though, so I intend to do the computation once for a big number of fixed effects and save it with the package, along with some descriptors that allows us to filter it. Then, the generation can be done from this table, and we can just switch the names of the fixed effects to the once supplied by the user.
It might make more sense to save the actual formulas (again with info for filtering). That could make it very fast to use the function.

If loading the big table is too slow, I guess we can make smaller versions as well for those purposes.

Note: I still have to figure out how to generate the formulas from the table, so everything will likely change. ;)

replacing `broom` tidiers with `parameters` to reduce no. of dependencies

Before making a PR related to this, I was wondering if you would be open to this. If you agree, I will open a PR.

rationale

parameters (https://easystats.github.io/parameters/) has way fewer dependencies and can handle pretty much every model that broom and broom.mixed combined support. It offers a number of other additional features not in broom (e.g., robust SEs, standardization, etc.)

dependency calculations

tools::package_dependencies("broom")
#> $broom
#>  [1] "backports" "dplyr"     "ellipsis"  "generics"  "glue"      "methods"  
#>  [7] "purrr"     "rlang"     "stringr"   "tibble"    "tidyr"

tools::package_dependencies("broom.mixed")
#> $broom.mixed
#>  [1] "broom"    "dplyr"    "tidyr"    "plyr"     "purrr"    "tibble"  
#>  [7] "reshape2" "nlme"     "methods"  "stringr"  "coda"     "TMB"     
#> [13] "cubelyr"

tools::package_dependencies("parameters")
#> $parameters
#> [1] "bayestestR" "insight"    "methods"    "stats"      "utils"

example with `merMod`

library(lme4)
#> Loading required package: Matrix
library(magrittr)
library(parameters)

lmer_mod <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)

broom.mixed::tidy(lmer_mod, effects = "fixed")
#> # A tibble: 2 x 5
#>   effect term        estimate std.error statistic
#>   <chr>  <chr>          <dbl>     <dbl>     <dbl>
#> 1 fixed  (Intercept)    251.       6.82     36.8 
#> 2 fixed  Days            10.5      1.55      6.77

parameters::standardize_names(parameters::model_parameters(lmer_mod), style = "broom") %>%
  tibble::as_tibble()
#> # A tibble: 2 x 9
#>   term  estimate std.error conf.level conf.low conf.high statistic df.error
#>   <chr>    <dbl>     <dbl>      <dbl>    <dbl>     <dbl>     <dbl>    <int>
#> 1 (Int…    251.       6.82       0.95   238.       265.      36.8       174
#> 2 Days      10.5      1.55       0.95     7.44      13.5      6.77      174
#> # … with 1 more variable: p.value <dbl>

example with `lm`

lm_mod <- lm(Reaction ~ Days, sleepstudy)

broom::tidy(lm_mod)
#> # A tibble: 2 x 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)    251.       6.61     38.0  2.16e-87
#> 2 Days            10.5      1.24      8.45 9.89e-15

parameters::standardize_names(parameters::model_parameters(lm_mod), style = "broom") %>%
  tibble::as_tibble()
#> # A tibble: 2 x 9
#>   term  estimate std.error conf.level conf.low conf.high statistic df.error
#>   <chr>    <dbl>     <dbl>      <dbl>    <dbl>     <dbl>     <dbl>    <int>
#> 1 (Int…    251.       6.61       0.95   238.       264.      38.0       178
#> 2 Days      10.5      1.24       0.95     8.02      12.9      8.45      178
#> # … with 1 more variable: p.value <dbl>

^{Created on 2021-02-18 by the reprex package (v1.0.0)}

Confusion matrix issue (It's that time of the year)

I've had a few students running logistic regressions with code that otherwise has run previously and getting a:

Error in value[3L] :
Confusion matrix error: Error in [.default(data, , positive): subscript out of bounds

Any update to caret that hasn't been taken into account? (I'll make a reproducible example tonight)

Stratified cross validation

Suggest adding stratified cross validation for unbalanced data.

Feature request: column and row sums in plot_confusion_matrix

For plot_confusion_matrix() it would be useful to have additional sums/% feature available in the plot function as below.

Supplied Probabilities

I'm curious if this statement from the documentation is true? (The added emphasis is mine.)

"One column with the probability of class being the second class alphabetically."

Could it actually be the case that what actually matters is the second level when the target column is a factor?

Cleaning up the binomial output

CI2 is confusing. Can you remove it and report lower CI and upper CI (instead of CI1 and CI2).
Precision and recall could be removed, so to avoid duplicates. Then the manual could explain that.

A possible robust balanced accuracy estimation, which accounts for random effects is this: http://people.inf.ethz.ch/bkay/downloads/Readme_micp_for_R.pdf

Simpler layout for confusion matrix plots

First off, this is a great package and thanks for all the work!

I'd like to pitch a feature request for the plot_confusion_matrix() function. As it is currently implemented it is only possible to add row and column percentages to every tile. While in some instances it may be useful to have these metrics for each tile, it does leave quite a bit of clutter. I suspect for most people, restricting the row and column percentages to be disaplyed only in the diagonal tiles would provide the most essential information: errors of omission/commission for each class (or rather 100% - errors).

Confusion matrix error: Error: `data` and `reference` should be factors with the same levels.

Hej Ludvig,
cvms suddenly stopped working on the following script, with the following data. It worked last time I run it in november and now it gives a confusion matrix error. Any quick fix?

DataLaughLudvig.txt

`Data <- read_delim("DataLaughLudvig.txt",delim="\t")

Data$Truth=as.factor(Data$Truth)

Data <- fold(Data, k = 5) %>%
arrange(.folds)

models <- c("Truth ~ 1",
"Truth ~ 1+MeanInterVoicingInterval",
"Truth ~ 1+MeanInterVoicingInterval+HNRIQR",
"Truth ~ 1+MeanInterVoicingInterval+HNRIQR+HNRMedian",
"Truth ~ 1+MeanInterVoicingInterval+HNRIQR+HNRMedian+PitchLogMedian",
"Truth ~ 1+MeanInterVoicingInterval+HNRIQR+PitchLogMedian"
)

CV2 <- cross_validate(Data, models,
folds_col = '.folds',
family='binomial')

Issues with the next version of ggplot2

We are preparing the next release of ggplot2 and our reverse dependency tests shows an issue with cvms. The issue revolves around the deprecation of turning off a guide with guides(<aes> = FALSE). Internally the guide is set to "none" instead of FALSE which breaks this unit test

cvms/tests/testthat/test_plotting_functions.R

Line 340 in 36f7f7b

expect_true(!p1$guides$fill[[1]])

since you can't use ! on "none"

We plan to release ggplot2 next week

best
Thomas

Cross fold issue

I have just checked your post. I might be using your package soon and wanted to give it a go, but...
I have two problems.
you are missing line where you load the data:
data <- participant.scores
I suppose this is the data used...?

Also, when I run the 
`data <- fold(data, k = 4, cat_col = 'diagnosis', id_col = 'participant', num_fold_cols = 3)`
Error in fold(data, k = 4, cat_col = "diagnosis", id_col = "participant",  : 
  unused argument (num_fold_cols = 3)

It is a fresh instalation...:

> packageVersion("cvms")
[1] ‘0.0.0.9000’
packageVersion("groupdata2")
[1] ‘1.0.0’

Add labels to some percentage numbers

Hi,

plot_confusion_matrix() is a great way to display Sensitivity, Specificity, PPV and NPV etc. Is there a way to have a label (name) in front of the numbers to make the figure self explanatory? For example Sens 89.0%, PPV 91.1% etc? It seem the plot itself has plenty of white space to accommodate the labels.

Thanks much.

Grand total is empty when add_normalized is set to FALSE

Dear @LudvigOlsen,

Grand total is disabled when add_normalized = FALSE. I think Grand total should still be display because add_normalized is about percentage, not count. Making sense?

Here is an example:

Create targets and predictions data frame

data <- data.frame(
"target" = c("A", "B", "A", "B", "A", "B", "A", "B",
"A", "B", "A", "B", "A", "B", "A", "A"),
"prediction" = c("B", "B", "A", "A", "A", "B", "B", "B",
"B", "B", "A", "B", "A", "A", "A", "A"),
stringsAsFactors = FALSE
)

Evaluate predictions and create confusion matrix

eval <- evaluate(
data = data,
target_col = "target",
prediction_cols = "prediction",
type = "binomial"
)

eval

Or plot first confusion matrix in evaluate() output

plot_confusion_matrix(eval,
add_normalized = FALSE,
add_sums = TRUE)

From scientific to common notation in output

Currently the function produces an output in scientific notation (e.g. 1.466119e-02). It would be less intimidating to produce it in common notation (e.g. 0.01466119).

Error in fitting glmer models

I tried fitting several (silly) models with cross_validate_list, but the function would stop, instead of just returning NA's for the model. I added TryCatch to each of the fit_model_ parts. That fixed it.

fit_model_ = function(model, model_type, training_set, family, REML, model_verbose){

My version of fit_model_

Checks the model_type and fits the model on the training_set

if (model_type == 'lm'){

if (model_verbose == TRUE){
  print('Used lm()')}

_Fit the model using lm()_
_Return this model to model_temp_
return(tryCatch({
  
  lme4::lm(model,training_set)
  
}, error = function(e){
  
  print("Model return error.")
  return(NULL)
  
}))

} else if (model_type == 'lmer'){

if (model_verbose == TRUE){
  print('Used lme4::lmer()')}

_Fit the model using lmer()_
_Return this model to model_temp_
return(tryCatch({
  
  lme4::lmer(model,training_set, REML = REML)
  
}, error = function(e){
  
  print("Model return error.")
  return(NULL)
  
}))

} else if (model_type == 'glm'){

if (model_verbose == TRUE){
  print('Used glm()')}

_Fit the model using glm()_
_Return this model to model_temp_
return(tryCatch({
  
  lme4::glm(model,training_set, family = family)
  
}, error = function(e){
  
  print("Model return error.")
  return(NULL)
  
}))

} else if (model_type == 'glmer'){

if (model_verbose == TRUE){
  print('Used lme4::glmer()')}

_Fit the model using glmer()_
_Return this model to model_temp_
return(tryCatch({

  lme4::glmer(model,training_set, family = family)
  
}, error = function(e){
  
  print("Model return error.")
  return(NULL)
  
}))

}

Multi-class confusion matrix with 0 values

Hej L,
I'm using your functions to visualize a multiclass confusion matrix and it gives me issues when some of the categories are never predicted for some of the targets.
Example data: https://www.dropbox.com/s/wc1ytv1ro9kyxow/predictions.csv?dl=0

"My" code (straight from the vignette):

conf_mat <- confusion_matrix(targets = Predictions$Reference,
predictions = Predictions$Prediction)
plot_confusion_matrix(conf_mat$Confusion Matrix[[1]])

The error:
1: In plot_confusion_matrix(conf_mat$Confusion Matrix[[1]]) :
'ggimage' is missing. Will not plot arrows and zero-shading.
2: In plot_confusion_matrix(conf_mat$Confusion Matrix[[1]]) :
'rsvg' is missing. Will not plot arrows and zero-shading.

The plot: https://www.dropbox.com/s/cr6n0c7rcsv1ik6/confmat.jpeg?dl=0

Poisson regression, pretty please :-)

Probably second to last issue (I'll be trying tomorrow the batch function). What about including Poisson regression (outcome is a count variable)? That's the third most common form of regression and it's relatively easy to implement: it works just like the linear regression. The main difference is that estimates have to be exponentiated to be interpreted (exp(x)), but that's of no relevance to the cross-validation. Performance is still measured as rmse. An example of poisson tutorial (not using glmer): http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm

support "rectangular" or "asymmetrical" confusion matrices

Hi,

I love the package, I was wondering if you have any suggestions on how to plot a confusion matrix that is asymmetrical.

Imagine a case in which I have more targets than predictions (4x3), currently, the figure plotted will be 4x4 but with one row greyed out.

Instead, would like to see that the model predictions are for the out-of-distribution class (the 4th target). Is this possible?

In my mind,

the easiest way would be to plot the actual 4x3 matrix (with the 4th class missing from the columns but present in the rows)

thank you

Grid search?

In the (awesome) readme it says something about grid search, but I cannot find a function to do so. Is it something you have to implement yourself, or is there some nifty trick I have overlooked? :))

Btw, totally loving the package - especially how well it works with repeated measures 👍

default hyperparameters

If I write the model_fn as shown in the documentation then no matter what I try it won't use the default hyper parameter and instead gives an error.

library(cvms)
library(dplyr)
library(groupdata2)

model_fn_loess <- function(train_data, formula, hyperparameters) {
  hyperparameters <- cvms::update_hyperparameters(span = 0.75, 
    hyperparameters = hyperparameters)
  loess(formula = formula, data = train_data, span = hyperparameters[["span"]])
}
predict_loess <- function(test_data, model, formula, hyperparameters, train_data) {
  predict(model, test_data)
}

fold_data <- fold(mtcars, k = 4) %>% arrange(.folds)
formulas <- c("mpg ~ .", "mpg ~ cyl + disp + hp")

# attempt 1
outRF <- cross_validate_fn(fold_data, formulas, "gaussian", 
  model_fn = model_fn_rf, predict_fn = predict_rf)
## Will cross-validate 2 models. This requires fitting 8 model instances.
## Error in value[[3L]](cond) : ---
## cross_validate_fn(): Error: Assertion failed. One of the following must apply:
##  * checkmate::check_list(hyperparameters): Must be of type 'list', not
##  * 'NULL'
##  * checkmate::check_data_frame(hyperparameters): Must be of type
##  * 'data.frame', not 'NULL'

# attempt 2
outRF <- cross_validate_fn(fold_data, formulas, "gaussian", 
  model_fn = model_fn_rf, predict_fn = predict_rf,
  hyperparameters = list())
## Error: Assertion failed. One of the following must apply:
##  * checkmate::check_data_frame(hyperparameters): Must be of type
##  * 'data.frame' (or 'NULL'), not 'list'
##  * checkmate::check_list(hyperparameters): Must have length >= 1, but
##  * has length 0

# attempt 3
outRF <- cross_validate_fn(fold_data, formulas, "gaussian", 
  model_fn = model_fn_rf, predict_fn = predict_rf,
  hyperparameters = data.frame())
## Error: Assertion failed. One of the following must apply:
##  * checkmate::check_data_frame(hyperparameters): Must have at least 1
##  * rows, but has 0 rows
##  * checkmate::check_list(hyperparameters): Must be of type 'list' (or
##  * 'NULL'), not 'data.frame'

Writing the model_fn like this does work but I think it should not be necessary.

model_fn_loess <- function(train_data, formula, hyperparameters) {
  hyperparametrs <- if (missing(hyperparameters)) list(span = 0.75)
  else cvms::update_hyperparameters(span = 0.75, 
    hyperparameters = hyperparameters)
  loess(formula = formula, data = train_data, span = hyperparameters[["span"]])
}