hofnerb / stabs Goto Github PK

Stability Selection with Error Control

Home Page: https://cran.r-project.org/package=stabs

R 99.07% Shell 0.93%

stability-selection cran r-package r-language variable-selection variable-importance machine-learning resampling

stabs's Introduction

stabs

stabs implements resampling procedures to assess the stability of selected variables with additional finite sample error control for high-dimensional variable selection procedures such as Lasso or boosting. Both, standard stability selection (Meinshausen & Bühlmann, 2010, doi:10.1111/j.1467-9868.2010.00740.x) and complementarty pairs stability selection with improved error bounds (Shah & Samworth, 2013, doi:10.1111/j.1467-9868.2011.01034.x) are implemented. The package can be combined with arbitrary user specified variable selection approaches.

For an expanded and executable version of this file please see

vignette("Using_stabs", package = "stabs")

Installation

Current version (from CRAN):

install.packages("stabs")

Latest development version from GitHub:

library("devtools")
install_github("hofnerb/stabs")

To be able to use the install_github() command, one needs to install devtools first:

install.packages("devtools")

Using stabs

A simple example of how to use stabs with package lars:

library("stabs")
library("lars")
## make data set available
data("bodyfat", package = "TH.data")
## set seed
set.seed(1234)

## lasso
(stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
                       fitfun = lars.lasso, cutoff = 0.75,
                       PFER = 1))

## stepwise selection
(stab.stepwise <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
                          fitfun = lars.stepwise, cutoff = 0.75,
                          PFER = 1))

## plot results
par(mfrow = c(2, 1))
plot(stab.lasso, main = "Lasso")
plot(stab.stepwise, main = "Stepwise Selection")

We can see that stepwise selection seems to be quite unstable even in this low dimensional example!

User-specified variable selection approaches

To use stabs with user specified functions, one can specify an own fitfun. These need to take arguments x (the predictors), y (the outcome) and q the number of selected variables as defined for stability selection. Additional arguments to the variable selection method can be handled by .... In the function stabsel() these can then be specified as a named list which is given to args.fitfun.

The fitfun function then needs to return a named list with two elements selected and path:

selected is a vector that indicates which variable was selected.
path is a matrix that indicates which variable was selected in which step. Each row represents one variable, the columns represent the steps. The latter is optional and only needed to draw the complete selection paths.

The following example shows how lars.lasso is implemented:

lars.lasso <- function(x, y, q, ...) {
    if (!requireNamespace("lars"))
        stop("Package ", sQuote("lars"), " needed but not available")

    if (is.data.frame(x)) {
        message("Note: ", sQuote("x"),
                " is coerced to a model matrix without intercept")
        x <- model.matrix(~ . - 1, x)
    }

    ## fit model
    fit <- lars::lars(x, y, max.steps = q, ...)

    ## which coefficients are non-zero?
    selected <- unlist(fit$actions)
	## check if variables are removed again from the active set
    ## and remove these from selected
    if (any(selected < 0)) {
        idx <- which(selected < 0)
        idx <- c(idx, which(selected %in% abs(selected[idx])))
        selected <- selected[-idx]
    }

    ret <- logical(ncol(x))
    ret[selected] <- TRUE
    names(ret) <- colnames(x)
    ## compute selection paths
    cf <- fit$beta
    sequence <- t(cf != 0)
    ## return both
    return(list(selected = ret, path = sequence))
}

To see more examples simply print, e.g., lars.stepwise, glmnet.lasso, or glmnet.lasso_maxCoef. Please contact me if you need help to integrate your method of choice.

Using boosting with stability selection

Instead of specifying a fitting function, one can also use stabsel directly on computed boosting models from mboost.

library("stabs")
library("mboost")
### low-dimensional example
mod <- glmboost(DEXfat ~ ., data = bodyfat)

## compute cutoff ahead of running stabsel to see if it is a sensible
## parameter choice.
##   p = ncol(bodyfat) - 1 (= Outcome) + 1 ( = Intercept)
stabsel_parameters(q = 3, PFER = 1, p = ncol(bodyfat) - 1 + 1,
                   sampling.type = "MB")
## the same:
stabsel(mod, q = 3, PFER = 1, sampling.type = "MB", eval = FALSE)

## now run stability selection
(sbody <- stabsel(mod, q = 3, PFER = 1, sampling.type = "MB"))
opar <- par(mai = par("mai") * c(1, 1, 1, 2.7))
plot(sbody, type = "paths")
par(opar)

plot(sbody, type = "maxsel", ymargin = 6)

Citation

To cite the package in publications please use

citation("stabs")

which will currently give you

To cite package 'stabs' in publications use:

  Benjamin Hofner and Torsten Hothorn (2021). stabs: Stability
  Selection with Error Control, R package version R package version
  0.6-4, https://CRAN.R-project.org/package=stabs.

  Benjamin Hofner, Luigi Boccuto and Markus Goeker (2015). Controlling
  false discoveries in high-dimensional situations: Boosting with
  stability selection. BMC Bioinformatics, 16:144.
  doi:10.1186/s12859-015-0575-3
  
To cite the stability selection for 'gamboostLSS' models use:

  Thomas, J., Mayr, A., Bischl, B., Schmid, M., Smith, A., 
  and Hofner, B. (2017). Gradient boosting for distributional regression -
  faster tuning and improved variable selection via noncyclical updates. 
  Statistics and Computing. Online First. DOI 10.1007/s11222-017-9754-6  

Use ‘toBibtex(citation("stabs"))’ to extract BibTeX references.

To obtain BibTeX references use

toBibtex(citation("stabs"))

stabs's People

Contributors

Stargazers

Watchers

Forkers

richardbeare springdance andreyto gokceneraslan helshi nguyentr17 guhjy iscoul marastadler

stabs's Issues

Improve error handling (2)

library("stabs")
library("lars")
x <- matrix(rnorm(100*20),100,20)
beta <- c(rep(1, 5), rep(0, 15))
y <- x %*% beta + rnorm(100)
stabs <- stabsel(x, y, PFER = 1, q = 19, fitfun = lars.lasso, assumption = "unimodal")
## Error in if (cutoff <= 3/4) { : missing value where TRUE/FALSE needed
stabs <- stabsel(x, y, PFER = 1, q = 19, fitfun = lars.lasso, assumption = "r-concave")
## Error in minD(q, p, cutoff[i], B) : ‘q’ must be <= 9

In the latter case just set call. = FALSE.

q = 1 -> more than one variable selected

I have a quick question:
I used stabsel in combination with glmnet.lasso and set q to be 1.
However, the results show that more than one variable has been selected - how is that possible?

stabsel(x = data[predict], y = data$freq_calls, fitfun=glmnet.lasso, args.fitfun=list(family="poisson"), PFER = 0.2, q=1, sampling.type="SS")

Stability Selection with unimodality assumption

Selected variables:
             A                E 
               2                6 

Selection probabilities:
  O                        N.                          Ag.                            Ed                 I                            G   
0.00                    0.01                    0.01                    0.02                    0.09                    0.10
C                          E                           A
0.19                    0.66                    0.67
                    

---
Cutoff: 0.65; q: 1; PFER (*):  0.192 
   (*) or expected number of low selection probability variables
PFER (specified upper bound):  0.2 
PFER corresponds to signif. level 0.0213 (without multiplicity adjustment)

Thank you already.

citation("stabs")

using citation("stabs") gives

Benjamin Hofner and Torsten Hothorn (2015). stabs: Stability Selection with Error Control, R package version R package version 0.5-1, http://CRAN.R-project.org/package=stabs.

Benjamin Hofner, Luigi Boccuto and Markus Goeker (2014). Controlling false discoveries in high-dimensional situations: Boosting with stability selection. arXiv:1411.1285. http://arxiv.org/abs/1411.1285.

which doubly contains "R package version" in the reference to the manual and does not mention the more recent version of the paper at e.g. http://www.ncbi.nlm.nih.gov/pubmed/25943565

"base-learner" needs to be "variable" or "effects"

as stabsel is not solely applicable for boosting

Make new dependency checks happy

To make the new dependency checks on CRAN happy go along the following recipe (as suggested by Kurt Hornik).

1.) Copy the functions listed after Undefined global functions or variables: in the output from R CMD check in a variable txt:

txt <- "abline axis hcl matplot par plot points tail"

2.) With the function

imports_for_undefined_globals <- function(txt, lst, selective = TRUE) {
    if(!missing(txt))
        lst <- scan(what = character(), text = txt, quiet = TRUE)
    nms <- lapply(lst, find)
    ind <- sapply(nms, length) > 0L
    imp <- split(lst[ind], substring(unlist(nms[ind]), 9L))
    if(selective) {
        sprintf("importFrom(%s)",
                vapply(Map(c, names(imp), imp),
                       function(e)
                           paste0("\"", e, "\"", collapse = ", "),
                       ""))
    } else {
        sprintf("import(\"%s\")", names(imp))
    }
}

one can obtain the imports via

writeLines(imports_for_undefined_globals(txt))

3.) Copy and paste the output to NAMESPACE.
4.) Add the packages to Imports in DESCRIPTION.

Continue stability selection after errors

If an error occures in a single fold, continue execution of stabsel.
For the evaluation discard this fold and issue a warning.

Disallow specification of penalty parameters via args.fitfun

see #21 for an example.

Simulation-based FDR estimation

Especially for genomics applications, it would be nice to also have FDR procedure in the implementation.

Randomized Logistic Regression

Hi there,

Can I use the function 'stabsel' for multinomial classification after factorizing the class?

If so, how does this compare to 'RandomizedLogisticRegression' from sklearn?

Thank you!

Provide functions to compute max(q) and/or range(cutoff)

For assumption = "unimodal" it must hold:

cutoff >= 1/2 + min(theta^2, 1 / (2*B) + 3/4 * theta^2)

For assumption = "r-concave":
Export an interface to maxQ

Make vignette

Add vignette based on README.md
Add vignette showing how to use stabsel with mboost and gamboostLSS (the latter perhaps based on the paper?)

Selection paths / selections

Is a variable considered to be selected if it is selected at some point and removed again later on? See #5

Fix issue when variables are dropped from active set in lars.lasso

library("stabs")
library("lars")
x <- matrix(rnorm(100*20),100,20)
beta <- c(rep(1, 5), rep(0, 15))
y <- x %*% beta + rnorm(100)
stabs <- stabsel(x, y, PFER = 1, q = 19, fitfun = lars.lasso, assumption = "none")
## Error in ret[selected] <- TRUE :
##    only 0's may be mixed with negative subscripts

(error reported by Markus Scheibengraber)

No variables selected, on high dimensional data.

I used stab package for high dimensional gene type of data, which, regardless how I tried, (lars.lasso would not work on input, but glmnet.lasso works just fine), it shows that ,everytime I run, 'no variables selected '. for example,

stab_lasso<-stabsel(x =train_dat, y =tr_target, fitfun =glmnet.lasso, cutoff =0.75,PFER =0.2)

my data had extremely high dimensionality 6248 columns and low observations with 224 rows.

If I use glmnet.lasso_maxCoef, the algorithm would just return first 45 genes as being selected, this is clearly not right.

What am I doing wrong? has anyone, except the low dimensional data, tried on real world high dimensional data?

CRAN packages not using Suggests: packages conditionally

As §1.1.3.1 of the manual told you, packages in Suggests: should be used conditionally. The latest version of igraph will not install on Solaris (hence packages strictly depending on it), and as you can see from its CRAN checks page, your package now fails its check.

Please correct ASAP and definitely within a month.

glmnet.lasso_maxCoef(): incorrect coefficient sorting? [bug]

Hi,
on line 129 of fitfuns.R, it looks like glmnet.lasso_maxCoef() is not sorting the coefficients in the right order. The code is:
selected <- order(coef(fit)[-1])[1:q]

Here is a reproducible example that would capture lines 126 and 129:

library(glmnet)
library(stabs)
set.seed(36)

data(QuickStartExample)
x <- QuickStartExample$x
y <- QuickStartExample$y

lambda.1se <- cv.glmnet(x, y)$lambda.1se # 0.1322761

fit <- glmnet(x, y, lambda = lambda.1se)
fit$df # 8

## coefficients minus the intercept
coef_vec <- coef(fit)[-1] 
coef_vec
# 1.3019591  0.0000000  0.6422423  0.0000000 -0.7892393  0.4944803  0.0000000  0.2943201  0.0000000  0.0000000  0.1058430
# 0.0000000  0.0000000 -1.0402308  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000 -0.9791165

##  line 129 :: order(coef(fit)[-1])[1:q]
i1 <- order(coef_vec); i1
# [1] 14 20  5  2  4  7  9 10 12 13 15 16 17 18 19 11  8  6  3  1
q <- 9; coef_vec[i1[1:q]]
# -1.0402308 -0.9791165 -0.7892393  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000

I would have expected the sorting to be done on the absolute value of the coefficients. Assuming this is the intent, two possible solutions are:

i2 <- order(abs(coef_vec), decreasing = TRUE); i2
# [1]  1 14 20  5  3  6  8 11  2  4  7  9 10 12 13 15 16 17 18 19
coef_vec[i2[1:q]]
# 1.3019591 -1.0402308 -0.9791165 -0.7892393  0.6422423  0.4944803  0.2943201  0.1058430  0.0000000

selected <- predict(fit, type = "nonzero")
selected <- selected[[length(selected)]] ## length is 1 when lambda is a scalar
selected
#  1  3  5  6  8 11 14 20

I believe the later solution is used in glmnet.lasso() on lines 31 and 32 of fitfuns.R

Thanks
Peter

## verify that the same coefficients are chosen 
all(selected %in% i2[1:fit$df])
# TRUE

Stablity selection [R] stabs::stabsel "Error in res[[1]] : subscript out of bounds"

Hello,
I'am analysing sequences data for my master thesis. Doing so I'am intenting to use feature selection to find the most important features of my sequences to predict my outcome (salary at 30 years old).
When I run the following code in Rstudio I get an error:

Reglin_control_features_sala <- lm(t9salafteall_comp ~ Agglo + Sex + Migr_gen + Migr_reg + Langue_maison_dummy + Reg_ling, data=alldata_formation_salaires_NoNAs, weights = basewt)

set.seed(999)
y <- residuals(Reglin_control_features, type = "deviance")
x <- Features_12mois_SD

prop_stabsel <- stabs::stabsel( x= x , y = y, fitfun = glmnet.lasso, PFER=0.5, cutoff= 0.6,
B=1000)
=> "Error in res[[1]] : subscript out of bounds"

I tried different values for the parmeters PFER, cutoff and B, always with the same result, my professor, Matthias Studer, does not know where this error come from, that is why I am asking you directly.

Thanks a lot for your help,
Regards,

Leonhard.

Fix plot

If a matrix is used instead of a data frame, y-labs are wrong:

library("stabs")
library("lars")
x <- matrix(rnorm(100*20),100,20)
beta <- c(rep(1, 5), rep(0, 15))
y <- x %*% beta + rnorm(100)
stabs <- stabsel(x, y, PFER = 1, q = 5, fitfun = lars.lasso, assumption = "none")
plot(stabs)

## with data frame everything is fine:
x <- as.data.frame(x)
plot(stabsel(x, y,PFER=1,q = 5,fitfun=lars.lasso,assumption="none"))

Use `mc.preschedule = FALSE` per default with `mclapply`

This reduces the impact of errors in single folds and might speed up run time.

pelase add an example that use glmnet insead of lars

I have a script that use glmnet to fit lasso logistic models. I would like to maintain the script as is and implement ot top of that the stabsel, I do not want to change the script to use lars instead of gmlnet.
Can you please add an example of stabsel applyed to glmnet?

no variables are selected

Hello hofnerb,
If I do not fix my lambda, the algorithm would not select any of the variables in the high dimensional data that I have. here is the code and the outcome:

stab_lasso<-stabsel(x =train_dat, y =tr_target, fitfun =glmnet.lasso, cutoff =0.75,PFER =1)
Stability Selection with unimodality assumption

No variables selected

Can I use stabsel on non-zero matrix after lasso?

Dear Hofner,

I have many large matrices (1000 obs * 15000 vars) for lasso and variable selection. To speed up, I think it would be much faster to run stabsel() on data with variables whose lasso coefficients >0 (columns of x matrix with zero coefficients are removed).

Is this reasonable? Running stabsel() on full x matrix will return pvalue for all variables in x, while running it on reduced x matrix is much faster. The order of resulted pvalue for variables seem to be consistent using x or reduced x, but values are different.

Here is my code:

library(glmnet)
library(stabs)

#data
set.seed(1234)
data("bodyfat", package = "TH.data")
x = as.matrix(bodyfat[, -2])
y = bodyfat[,2]

#use cv.glmnet to get best lambda
cv.fit=cv.glmnet(x,y, alpha = 1)
cv.fit$lambda.min
res = glmnet( x = x, y = y, family = "gaussian", alpha = 1, lambda =cv.fit$lambda.min )


#xuse: x with lasso coefficient>0 
nonZero = which(res$beta != 0)
xuse = as.matrix(x[, nonZero]) 

#stabsel on xuse (zero coefficient columns removed)
set.seed(1234)
stab.lasso1 <- stabsel(x = xuse, y = y, fitfun = glmnet.lasso,args.fitfun=list('family'="gaussian",'lambda'=cv.fit$lambda.min), cutoff = 0.75, PFER = 1) 

#stabsel on x
set.seed(1234)
stab.lasso2 <- stabsel(x = x, y = y, fitfun = glmnet.lasso,args.fitfun=list('family'="gaussian",'lambda'=cv.fit$lambda.min), cutoff = 0.75, PFER = 1) 

stab.lasso1$phat
stab.lasso2$phat
stab.lasso2$phat[nonZero]

output:

stab.lasso1$phat
s0
waistcirc 1.00
hipcirc 1.00
kneebreadth 0.97
anthro3a 0.68
anthro3b 0.92
anthro3c 0.57

stab.lasso2$phat
s0
age 0.52
waistcirc 0.99
hipcirc 1.00
elbowbreadth 0.45
kneebreadth 0.94
anthro3a 0.57
anthro3b 0.85
anthro3c 0.56
anthro4 0.17

stab.lasso1$selected
waistcirc hipcirc kneebreadth anthro3b
1 2 3 5
stab.lasso2$selected
waistcirc hipcirc kneebreadth anthro3b
2 3 5 7

Running stabsel() on x or reduced x (xuse) have the same selected variables. But is there any potential problem to run stabsel() on reduced x?

Thank you!

Documentation for parLapply or sequential runs

Thanks for developing and maintaining the package. I was wondering if you have any recommendations for using parLapply as a parallel backend. I realized sometimes mclapply raises error. For example, this error:

Error in `.check_ncores(cores)`: 3 simultaneous processes spawned

The error is hard to reproduce. Sometimes it happens sometimes it does not.

coefficients from each run

Is it possible to fetch the coefficients of the variables from each run? It is great to know the stable associations and would be even better to know the direction and degree of relationship between response and variables.

Graphical models query

Hi,
Not sure if this is the forum you'd like to use for queries - let me know if it isn't.

I'm exploring approaches using the JGL package, specifically the fused group lasso. I'm likely to be working with two groups. I have the mechanisms in place to compute the two lambda values. The difference in partial correlation coefficient for corresponding graph edges is of interest. I have explored bootstrapping approaches to characterising this, but a stability selection approach looks interesting.

I'm unsure of how to use the q parameter in this setting. Do you have examples for glasso-like cases? I also need to be careful about how the resampling occurs within groups.

Thanks

glmnet.lasso_maxCoef() is missing a lambda argument? [bug]

Hi,
before I get into this issue report, I wanted to thank you for writing such a useful package :)

I have been using stabsel() with glmnet() and noticed the following on line 126 of glmnet.lasso_maxCoef:
fit <- glmnet::glmnet(x, y, ...)

glmnet.lasso_maxCoef() does not seem to be using the penalty parameter lambda= that is passed to it.

I was expecting line 126 to look like:

fit <- glmnet::glmnet(x, y, lambda=lambda, ...)

Assuming the lambda value passes the check on line 120, is it being captured by the ...? If not this looks like a bug.