stan-dev / rstanarm Goto Github PK

rstanarm R package for Bayesian applied regression modeling

License: GNU General Public License v3.0

R 84.67% Shell 0.03% Stan 12.88% CSS 0.29% HTML 1.57% C++ 0.20% M4 0.12% TeX 0.23%

stan multilevel-models r-package bayesian bayesian-inference bayesian-methods bayesian-statistics bayesian-data-analysis statistical-modeling r

rstanarm's Introduction

rstanarm

Bayesian applied regression modeling (arm) via Stan

This is an R package that emulates other R model-fitting functions but uses Stan (via the rstan package) for the back-end estimation. The primary target audience is people who would be open to Bayesian inference if using Bayesian software were easier but would use frequentist software otherwise.

Fitting models with rstanarm is also useful for experienced Bayesian software users who want to take advantage the pre-compiled Stan programs that are written by Stan developers and carefully implemented to prioritize numerical stability and the avoidance of sampling problems.

Click the arrows for more details:

More detail

The rstanarm package is an appendage to the rstan package, the R interface to Stan. rstanarm enables many of the most common applied regression models to be estimated using Markov Chain Monte Carlo, variational approximations to the posterior distribution, or optimization. The package allows these models to be specified using the customary R modeling syntax (e.g., like that of glm with a formula and data.frame). Additional arguments are provided for specifying prior distributions.

The set of models supported by rstanarm is large (and will continue to grow), but also limited enough so that it is possible to integrate them tightly with the pp_check function for graphical posterior predictive checks using bayesplot and the posterior_predict function to easily estimate the effect of specific manipulations of predictor variables or to predict the outcome in a training set.

The fitted model objects returned by the rstanarm modeling functions are called stanreg objects. In addition to all of the traditional methods defined for fitted model objects, stanreg objects can also be used with the loo package for leave-one-out cross-validation, model comparison, and model weighting/averaging and the shinystan package for exploring the posterior distribution and model diagnostics with a graphical user interface.

Check out the rstanarm vignettes for examples and more details about the entire process.

Modeling functions

The model estimating functions are described in greater detail in their individual help pages and vignettes. Here we provide a very brief overview:

stan_lm, stan_aov,stan_biglm

Similar to lm and aov but with novel regularizing priors on the model parameters that are driven by prior beliefs about R-squared, the proportion of variance in the outcome attributable to the predictors in a linear model.
stan_glm, stan_glm.nb

Similar to glm but with various possible prior distributions for the coefficients and, if applicable, a prior distribution for any auxiliary parameter in a Generalized Linear Model (GLM) that is characterized by a family object (e.g. the shape parameter in Gamma models). It is also possible to estimate a negative binomial model similar to the glm.nb function in the MASS package.
stan_glmer, stan_glmer.nb, stan_lmer

Similar to the glmer, glmer.nb, and lmer functions (lme4 package) in that GLMs are augmented to have group-specific terms that deviate from the common coefficients according to a mean-zero multivariate normal distribution with a highly-structured but unknown covariance matrix (for which rstanarm introduces an innovative prior distribution). MCMC provides more appropriate estimates of uncertainty for models that consist of a mix of common and group-specific parameters.
stan_nlmer

Similar to nlmer (lme4 package) package for nonlinear "mixed-effects" models, but flexible priors can be specified for all parameters in the model, including the unknown covariance matrices for the varying (group-specific) coefficients.
stan_gamm4

Similar to gamm4 (gamm4 package), which augments a GLM (possibly with group-specific terms) with nonlinear smooth functions of the predictors to form a Generalized Additive Mixed Model (GAMM). Rather than calling lme4::glmer like gamm4 does, stan_gamm4 essentially calls stan_glmer, which avoids the optimization issues that often crop up with GAMMs and provides better estimates for the uncertainty of the parameter estimates.
stan_polr

Similar to polr (MASS package) in that it models an ordinal response, but the Bayesian model also implies a prior distribution on the unknown cutpoints. Can also be used to model binary outcomes, possibly while estimating an unknown exponent governing the probability of success.
stan_betareg

Similar to betareg (betareg package) in that it models an outcome that is a rate (proportion) but, rather than performing maximum likelihood estimation, full Bayesian estimation is performed by default, with customizable prior distributions for all parameters.
stan_clogit

Similar to clogit (survival package) in that it models an binary outcome where the number of successes and failures is fixed within each stratum by the research design. There are some minor syntactical differences relative to survival::clogit that allow stan_clogit to accept group-specific terms as in stan_glmer.
stan_mvmer

A multivariate form of stan_glmer, whereby the user can specify one or more submodels each consisting of a GLM with group-specific terms. If more than one submodel is specified (i.e. there is more than one outcome variable) then a dependence is induced by assuming that the group-specific terms for each grouping factor are correlated across submodels.
stan_jm

Estimates shared parameter joint models for longitudinal and time-to-event (i.e. survival) data. The joint model can be univariate (i.e. one longitudinal outcome) or multivariate (i.e. more than one longitudinal outcome). A variety of parameterisations are available for linking the longitudinal and event processes (i.e. a variety of association structures).

Estimation algorithms

The modeling functions in the rstanarm package take an algorithm argument that can be one of the following:

Sampling (algorithm="sampling"):

Uses Markov Chain Monte Carlo (MCMC) --- in particular, Stan's implementation of Hamiltonian Monte Carlo (HMC) with a tuned but diagonal mass matrix --- to draw from the posterior distribution of the parameters. This is the slowest but most reliable of the available estimation algorithms and it is the default and recommended algorithm for statistical inference.

Mean-field (algorithm="meanfield"):

Uses mean-field variational inference to draw from an approximation to the posterior distribution. In particular, this algorithm finds the set of independent normal distributions in the unconstrained space that --- when transformed into the constrained space --- most closely approximate the posterior distribution. Then it draws repeatedly from these independent normal distributions and transforms them into the constrained space. The entire process is much faster than HMC and yields independent draws but is not recommended for final statistical inference. It can be useful to narrow the set of candidate models in large problems, particularly when specifying QR=TRUE in stan_glm, stan_glmer, and stan_gamm4, but is only an approximation to the posterior distribution.

Full-rank (algorithm="fullrank"):

Uses full-rank variational inference to draw from an approximation to the posterior distribution by finding the multivariate normal distribution in the unconstrained space that --- when transformed into the constrained space --- most closely approximates the posterior distribution. Then it draws repeatedly from this multivariate normal distribution and transforms the draws into the constrained space. This process is slower than meanfield variational inference but is faster than HMC. Although still an approximation to the posterior distribution and thus not recommended for final statistical inference, the approximation is more realistic than that of mean-field variational inference because the parameters are not assumed to be independent in the unconstrained space. Nevertheless, fullrank variational inference is a more difficult optimization problem and the algorithm is more prone to non-convergence or convergence to a local optimum.

Optimizing (algorithm="optimizing"):

Finds the posterior mode using a C++ implementation of the LBGFS algorithm. If there is no prior information, then this is equivalent to maximum likelihood, in which case there is no great reason to use the functions in the rstanarm package over the emulated functions in other packages. However, if priors are specified, then the estimates are penalized maximum likelihood estimates, which may have some redeeming value. Currently, optimization is only supported for stan_glm.

Resources

mc-stan.org/rstanarm (online documentation, vignettes)
Ask a question (Stan Forums on Discourse)
Open an issue (GitHub issues for bug reports, feature requests)

Installation

Latest Release

The most recent rstanarm release can be installed from CRAN via

install.packages("rstanarm")

Development Version

To install from GitHub, first make sure that you can install the rstan package and C++ toolchain by following these instructions. Once rstan is successfully installed, you can install rstanarm from GitHub using the remotes package by executing the following in R:

# Change 2 to however many cores you can/want to use to parallelize install
# If you experience crashes or run out RAM during installation, try changing this to 1
Sys.setenv(MAKEFLAGS = "-j2")
Sys.setenv("R_REMOTES_NO_ERRORS_FROM_WARNINGS" = "true")
remotes::install_github("stan-dev/rstanarm", INSTALL_opts = "--no-multiarch", force = TRUE)

You can switch build_vignettes to TRUE but it takes a lot longer to install and the vignettes are already separately available from the Stan website and CRAN. If installation fails, please let us know by filing an issue.

Survival Analysis Version

The feature/survival branch on GitHub contains a development version of rstanarm that includes survival analysis functionality (via the stan_surv modelling function). Until this functionality is available in the CRAN release of rstanarm, users who wish to use the survival analysis functionality can install a binary version of the survival branch of rstanarm from the Stan R packages repository with:

install.packages("rstanarm", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))

Note that this binary is static (i.e. it is not automatically updated) and is only hosted so that users can access the (experimental) survival analysis functionality without needing to go through the time consuming (and sometimes painful) task of installing the development version of rstanarm from source.

Contributing

If you are interested in contributing to the development of rstanarm please see the developer notes page.

rstanarm's People

Contributors

Stargazers

Watchers

Forkers

fandres70 mhandreae jburos pgnepal paulhendricks tomwallis sambrilleman lixiangchun detcitty singmann harleyjean tjmahr joshnr13 bluelou fsolt anhnguyendepocen eunbeekim datalulu paasim mansmeg nesanders radovankavicky gapdata dpsimpson jjtwwdei andylin512 cfitzge1 nemochina2008 phoenixstar7 mejihero bachlaw fartist wjakethompson lauken13 julbautista malcolmbarrett mcdonohue mcol rborchering statwonk alexadibenedetto alfcrisci devincaughey csetraynor dirmeier ermeel86 lcomm gakkilovemath sremmers nyiuab mattforshaw decision-sciences snaketron stjordanis jyuu elzinga kaz-yos abikoushi davors pamelanluna lilymedina liuxiangyu127 karthy257 flynngo mriganav061 economer danschrage alexanderbor hrossman maj-biostat alexandre-caron fanzheng0352 nohemihuanca proof-by-accident nspindel ssp3nc3r sfarhd14 racarana enbrown galickda wlandau fbartos eddygassama joshualiuxu stevebronder tinnguyen96 ehoumanevans samueldonnelly arrendi quinfer jbarsotti allisterh hiroshimamiya yawomkobara lycium0605 pio82 r-study-cxb padpadpadpad brendonchau jacobenglert

rstanarm's Issues

add support for ordinal and multinomial logit models

Before First Release

These tasks (and probably other things I'm forgetting) still need to be done before first release:

Change periods to underscores
Roxygen templates for function args (b6c09dc)
Sparse matrix stuff
Copy glmer stuff from continuous.stan to most of the others
Copy horseshoe stuff from continuous.stan to others (and make needed changes on R side)
Negative binomial log-likelihood function
Add a help page for the plotting functions (839b03f provides links to the RStan help files. For now the links will only work if the ggplot RStan branch is installed.)
Check which other functions need tests

errors after rstan update

@bgoodri

With rstan 2.9 I'm getting the following error when running

stan_glm(mpg ~ wt, data = mtcars)

sampling seems to proceeds ok, bun then

Error in dim(sssf) <- c(n2, object@sim$chains, length(tidx)) : 
dims [product 20000] do not match the length of object [40000]

which looks like it's coming from extract as defined in the stanfit-class.R file in rstan.

Separation of tensors and interactions in GAMM models

The gamm4 models use t2() in place of te() for tensor splines, but have to equivalent for ti(). This means one can't separate the interactions from the main effects, as in this example: http://stats.stackexchange.com/a/189437/28370

It would be great if stan_gamm4() could support ti() terms or something similar.

Related to #46

Wrong fixed effects estimates when running stan_glm without an intercept

I get weird fixed effects estimates when fitting a model without an intercept:

library(rstanarm)
devtools::install_github("paul-buerkner/brms")
library(brms)
foo1 <- stan_glm(mpg ~ -1 + ., data=mtcars, chains=2, iter=2000, warmup=500)
foo1

foo2 <- brm(mpg ~ -1 + ., data=mtcars, chains=2, iter=2000, warmup=500)
foo2

foo3 <- glm(mpg ~ -1 + ., data = mtcars)
summary(foo3)

The results of foo2 and foo3 closely match each other while foo1 produces very different results. Especially the MD_SD (and SD) estimates are a bit off. This may be a result of centering the fixed effects design matrix X regardless of whether an intercept is present or not. I believe that centering is invalid in models without an intercept (at least when not corrected otherwise) but I may be mistaken.

compiled C++ code for the model object is (sometimes) invalid

I'm working with some simulated data and started by running a few simple models. I'm assuming the generic Gaussian GLM is pre-compiled in rstanarm -- I'd installed 2.8.1 directly. This could be an idiosyncrasy of working off clone VMs, but I got the following error:

> rstan1 <- stan_glmer(qty ~ price + (1 + price | prod_key)
+                      , data = df
+                      , chains = 4
+                      , iter = 1000
+                      , prior = normal(0, 2)
+                      , prior_intercept = normal(100, 40)
+                      )
starting worker pid=24456 on localhost:11133 at 14:04:11.818
starting worker pid=24465 on localhost:11133 at 14:04:12.164
starting worker pid=24474 on localhost:11133 at 14:04:12.491
starting worker pid=24483 on localhost:11133 at 14:04:12.796
Error in checkForRemoteErrors(val) : 
  4 nodes produced errors; first error: the compiled object from C++ code for this model is invalid, possible reasons:
  - compiled with save_dso=FALSE;
  - compiled on a different platform;
  - not existed (created from reading csv files).
> stan_trace(rstan1)
Error in if (object$algorithm == "optimizing") stop("Plots not yet available for optimization (algorithm='optimizing')",  : 
  argument is of length zero
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.0 (Maipo)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rstan_2.8.1     ggplot2_1.0.1   shinystan_2.0.1 shiny_0.12.2    rstanarm_2.8.1  Rcpp_0.12.1     arm_1.8-6       lme4_1.1-10     Matrix_1.2-1    MASS_7.3-40    
[11] lazyeval_0.1.10 dplyr_0.4.1    

loaded via a namespace (and not attached):
 [1] lattice_0.20-31     zoo_1.7-12          gtools_3.5.0        assertthat_0.1      digest_0.6.8        mime_0.4            R6_2.1.1            plyr_1.8.1         
 [9] stats4_3.2.1        coda_0.18-1         httr_0.6.1          minqa_1.2.4         nloptr_1.0.4        DT_0.1              shinythemes_1.0.1   proto_0.3-10       
[17] devtools_1.9.1.9000 splines_3.2.1       shinyjs_0.2.3       RcppEigen_0.3.2.5.1 stringr_0.6.2       htmlwidgets_0.5     loo_0.1.3.9000      RCurl_1.95-4.5     
[25] munsell_0.4.2       httpuv_1.3.3        base64enc_0.1-3     htmltools_0.2.6     gridExtra_2.0.0     threejs_0.2.1       codetools_0.2-11    matrixStats_0.15.0 
[33] withr_1.0.0         bitops_1.0-6        BH_1.58.0-1         grid_3.2.1          nlme_3.1-120        xtable_1.8-0        gtable_0.1.2        DBI_0.3.1          
[41] magrittr_1.5        StanHeaders_2.8.0   scales_0.2.4        reshape2_1.4.1      dygraphs_0.5        xts_0.9-7           tools_3.2.1         markdown_0.7.7     
[49] abind_1.4-3         parallel_3.2.1      inline_0.3.14       colorspace_1.2-4    memoise_0.2.1       knitr_1.11

The traceback() output was unusual -- mostly comprised of a long list of numbers, presumably sampling args. It's interesting that if I try this model on a single chain (eschewing the parallel calls), it suddenly works. Because this error only sometimes happens, it's been tricky to pinpoint and diagnose. I don't recall it happening prior to 2.8.0, but this may just be one of the drawbacks of working with the bleeding edge versions...

allow binomial #trials to be specified using weights argument

posterior prediction following stan_gamm4

Currently, there is just a stub that throws an error saying it is unimplemented (unless you are only doing in-sample prediction). The predict.gam() function in the mgcv package has the complicated stuff for handling the smooth terms when newdata is specified and if type = "lpmatrix", it returns the necessary matrix of predictors rather than the predictions. But we need to dig into the details of it to make sure we are multiplying the correct columns by the correct margins of the posterior.

error when getting prior.info if subset argument is supplied

@bgoodri

These lines

all_args <- mget(names(formals()), sys.frame(sys.nframe()))
prior.info <- all_args[grep("prior", names(all_args), fixed = TRUE)]

are intended to collect all the prior-related arguments so after the model is fit we or the user can see what choices were made for the priors. Because match.call won't include these if the user leaves the default priors I came up with the solution above. But I just discovered that something like

stan_glm(mpg ~ wt, data = mtcars, subset = cyl < 8)

leads to the following error:

Error in mget(names(formals()), sys.frame(sys.nframe())) : 
  object 'cyl' not found

which I guess makes sense, but I'm not sure how to work around it. Any ideas?

Gaussian processes

Not a short-term priority, but could be nice to have some (limited) functionality for GPs in future iterations of the package.

non-MCMC options

Each stan_foo() function should have an option to do optimizing rather than sampling (and probably an option for variational inference too). Consequently, each stan_foo() function should have an option to turn the priors off, so that the user can obtain pure MLEs. These changes can probably be accomodated by the existing .stan files if we pass a has_priors flag.

numerical test failure on Solaris x86_32

https://www.r-project.org/nosvn/R.check/r-patched-solaris-x86/rstanarm-00check.html

Should we really use confint method for Bayesian credible intervals?

@bgoodri I've been thinking recently that it might avoid a great deal of confusion if we create a separate method for obtaining so-called credible intervals rather than using R's standard confint. Perhaps an S3 generic credint and a credint.stanreg method? (Or just a credint function, skipping the S3 method stuff?) If we want to do this then I can make the change quite easily this afternoon.

Here are some reasons why I think it's worth doing this

The documentation for stats::confint, which we can't change, clearly says it's for confidence intervals.
It would give us the opportunity to document credint, where we can make it clear what the differences are between frequentist/Bayesian confidence/credible intervals. We don't really do this anywhere at the moment but I think the differences in interpretation are important.
The cost is small. The only downside I can think of is that some users will try to call confint for a model not estimated by optimization, but we can either issue a warning to use credint (and allow confint to still be usable) or an error that forces the use of credint.
If we wanted we could provide credible intervals based on posterior quantiles (as we do now) but also give the option to compute HPD intervals (that would be valid assuming there isn't severe multimodality).

printing the result of a stan_glm is broken

test <- stan_glm(mpg ~ wt, data = mtcars)
> test

Call:  stan_glm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.226       -5.322  

Degrees of Freedom: Total (i.e. Null);  30 Residual
Error in signif(x$null.deviance, digits) : 
  non-numeric argument to mathematical function

> traceback()
4: format(signif(x$null.deviance, digits))
3: cat("Null Deviance:\t   ", format(signif(x$null.deviance, digits)), 
       "\nResidual Deviance:", format(signif(x$deviance, digits)), 
       "\tAIC:", format(signif(x$aic, digits)))
2: print.glm(x)
1: function (x, ...) 
   UseMethod("print")(x)

This seems to have to do with the fact that it is calling print.glm now.

ignoring priors?

From @waynefoltaERI (via brms issue 24)

For example, while playing with the mtcars dataset for this issue, I found that brms' and rstanarm's answers differed considerably. And glm agreed with rstanarm... which gave me a clue and when I investigated further, it appears that rstanarm is ignoring priors. (Or I am misunderstanding how they are specified, but not getting any error messages.) So the reason for the agreement is that I was specifying priors, but rstanarm was ignoring them and using flat (improper, frequentist-like) priors. Maybe has to do with their pre-compilation.

gaussian.stan can't handle stan_glmer models with no X matrix

For example, this model from the lme4 documentation:

stan_lmer(diameter ~ (1|plate) + (1|sample), data = Penicillin)

will result in an empty X matrix.

add support for binomial models

Currently Bernoulli is supported but not binomial

QR decompositions in rstanarm models

All the .stan programs in the rstanarm matrix utilize a design matrix X that is N by K. It would be interesting to explore an option to preprocess X with a QR decomposition

X = Q * R

where Q' * Q = I and R is upper triangular. Then if the linear predictor is

eta = X * beta = (Q * R) * beta = Q * (R * beta) = Q * theta

where

theta = R * beta

it is perhaps reasonable to put independent symmetric priors on the elements of theta because the columns of Q are mutually orthogonal and all have unit length. If so, you could recover the posterior distribution of beta as a generated quantity with the inverse of the fixed R matrix via

beta = R_inverse * theta

This idea may also help with the scaling problems we have seen with meanfield ADVI when doing regression problems.

warnings about VarCorr

Email from Martin Maechler:

Yesterday, a new version of 'nlme' was released on CRAN,
for which the VarCorr() S3 generic has a new '...' argument
--- so that others can write methods with extra arguments,
something really desirable.

However, this lead to the following WARNINGS for your packages,
which we'd ask you to fix.

I strongly recommend you to do what I (as co-author) did for the
lme4 package:

in your DESCRIPTION, depend on the very latest nlme,
i.e., put
nlme (>= 3.1-123)
in your 'Imports:' or 'Depends:' in your DESCRIPTION file
add ', ...' at the end of the argument list of your VarCorr.
function definition,
both in R/.R and man/.Rd

lme4/lme4@61de241

Fix errors caused by ggplot2 update

https://github.com/hadley/ggplot2/blob/master/NEWS.md

make "tidy" method for stanreg objects and submit PR to broom package

Should be pretty easy, just need to do it.

memory size error with large random effects models in rstanarm

Executive summary:
While rstanarm fits a random effects logistic regression model to my large dataset (~50,000 units of observations) and runs four chains without an issue, on trying to generate the stanfit object, I get a memory size error. There are about 2000 random effects in this model. The issue arises even with only 10 iterations. I have 8 RAM and four cores on a virtual windows server.

formulaR2.0 <- ond ~ pay +age_group +sex + (1 | cpt)

stanfit2.0 <- stan_glmer(formulaR2.0, data = myAQI, family = binomial, iter= 10, cores =2)
Loading required namespace: rstudioapi

Error: cannot allocate vector of size 2.7 Gb In addition:

Warning messages:
1: In structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr") :
Reached total allocation of 8191Mb: see help(memory.size)
2: In structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr") :
Reached total allocation of 8191Mb: see help(memory.size)
3: In structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr") :
Reached total allocation of 8191Mb: see help(memory.size)
4: In structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr") :
Reached total allocation of 8191Mb: see help(memory.size)

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] rstanarm_2.8.0 rstan_2.8.0 ggplot2_1.0.1 Rcpp_0.12.1 lme4_1.1-9
[6] Matrix_1.2-2 knitr_1.11

loaded via a namespace (and not attached):
[1] nloptr_1.0.4 plyr_1.8.3 shinyjs_0.2.0 xts_0.9-7
[5] base64enc_0.1-3 tools_3.2.2 digest_0.6.8 nlme_3.1-121
[9] gtable_0.1.2 lattice_0.20-33 rstudioapi_0.3.1 shiny_0.12.2
[13] shinystan_2.0.1 proto_0.3-10 loo_0.1.3 gridExtra_2.0.0
[17] stringr_1.0.0 gtools_3.5.0 dygraphs_0.4.5 htmlwidgets_0.5
[21] DT_0.1 stats4_3.2.2 grid_3.2.2 inline_0.3.14
[25] R6_2.1.1 minqa_1.2.4 reshape2_1.4.1 magrittr_1.5
[29] codetools_0.2-14 shinythemes_1.0.1 threejs_0.2.1 scales_0.3.0
[33] matrixStats_0.14.2 htmltools_0.2.6 MASS_7.3-43 splines_3.2.2
[37] xtable_1.7-4 mime_0.4 colorspace_1.2-6 httpuv_1.3.3
[41] stringi_0.5-5 munsell_0.4.2 markdown_0.7.7 zoo_1.7-12

add predict method

stan_glmer( ) mistranslates varying slope, varying intercept model

I ran into an error earlier today which led to NaNs in the log-likelihood matrix. Upon further inspection, I noticed that sigma had been estimated twice, and its mean took a negative value. Without dissecting the functions, I can only speculate that this is some kind of variable indexing bug in the "translation" between lmer syntax and stan code.

In this reproducible example, the first sigma appears to represent the group-level intercept, b[(Intercept) 1] seems to be its respective slope, and every subsequent variable name is off by one, so to speak. An estimate for the real sigma is not given. Otherwise, stan does a fine job of recovering the data-generating parameters.

Inference for Stan model: Gaussian GLM.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

                  mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
(Intercept)       1.92    0.04 0.71  0.47  1.54  1.91  2.29  3.45   338 1.01
x                 3.95    0.04 0.83  2.31  3.48  3.95  4.44  5.54   373 1.01
sigma            -0.96    0.04 0.72 -2.43 -1.34 -0.95 -0.60  0.52   341 1.01
b[(Intercept) 1] -1.97    0.04 0.85 -3.67 -2.43 -1.96 -1.46 -0.33   442 1.01
b[x 1]           -0.10    0.04 0.72 -1.66 -0.45 -0.09  0.29  1.40   359 1.01
b[(Intercept) 2]  0.22    0.04 0.84 -1.42 -0.26  0.20  0.71  1.90   441 1.01
b[x 2]            1.12    0.04 0.72 -0.39  0.75  1.12  1.50  2.62   328 1.01
b[(Intercept) 3]  1.99    0.04 0.86  0.33  1.50  1.97  2.50  3.67   372 1.01
sigma            -0.96    0.04 0.72 -2.43 -1.34 -0.95 -0.60  0.52   341 1.01
mean_PPD          3.98    0.00 0.06  3.87  3.95  3.98  4.02  4.10  1771 1.00

Samples were drawn using NUTS(diag_e) at Mon Aug 17 16:21:17 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

R code:

N <- 600
J <- 3
x <- runif(N)
group <- c(rep(1, N/J), rep(2, N/J), rep(3, N/J))
y <- group + 2.00 * group * x + rnorm(N)
simdat <- data.frame(cbind(y, x, group))

stan_1 <- stan_glmer(y ~ x + (1+x|group)
                     , data = simdat
                     , chains = 4
                     , iter = 1000
                     , family = gaussian()
                     , prior.for.intercept = normal(0,2)
                     , prior = normal(0,3)
                     )
stan_1
ll <- stan_1$log_lik
crossv <- loo::loo(ll, cores = 1) # NaNs not allowed error
#table(round(crossv$pareto_k)) # Inf?

rstanarm v. 2.7.0

allow using student_t instead of gaussian for likelihood

For stan_glm with a suitable outcome (and maybe stan_lmer), we should provide an option to use the t distribution as the likelihood instead of a Gaussian.

Perhaps we just follow BDA3 17.2 and use the fact that

y_i ~ t_v(m, s^2)

is equivalent to

y_i | V_i ~ N(m, V_i)
V_i ~ scaled-inv-chi-sq(v, s^2)

'meanfield' results don't match 'sampling' results

Seeing as it's the intercept that's so different, does this have to do with the centering?

fit <- stan_glm(mpg ~ wt, data = mtcars)
fit_vb <- stan_glm(mpg ~ wt, data = mtcars, algorithm = "meanfield")

> summary(fit)
                mean mcse   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
(Intercept)    37.25 0.04 1.94  33.31  35.96  37.22  38.55  41.09  2477    1
wt             -5.33 0.01 0.58  -6.45  -5.72  -5.33  -4.94  -4.22  2477    1
sigma           3.15 0.01 0.43   2.45   2.85   3.10   3.39   4.12  1813    1
mean_PPD       20.10 0.02 0.79  18.59  19.58  20.10  20.62  21.67  2636    1
log-posterior -53.10 0.04 1.32 -56.61 -53.65 -52.76 -52.14 -51.62   938    1


> summary(fit_vb)
               mean   sd  2.5%   25%   50%   75% 97.5%
(Intercept)    0.62 0.07  0.50  0.58  0.62  0.67  0.77
wt            -5.37 0.63 -6.65 -5.77 -5.35 -4.96 -4.16
sigma          3.12 0.34  2.52  2.88  3.10  3.35  3.87
mean_PPD      37.46 2.13 33.29 36.02 37.38 38.92 41.74
log-posterior 20.16 0.88 18.52 19.54 20.19 20.75 21.84

Release checklist

(Edit this comment to add additional checklist items or mark any as complete)

Code

Documentation

Roxygen templates for function args
Add a help page for the plotting functions
Remove mentions of advi from doc
Add references to BDA3 and other sources

Vignettes

package overview
lm
aov
polr
binomial/bernoulli glm
count glm (poisson, neg-binomial)
continuous glm (gaussian, gamma, inverse-gaussian)
glmer (point to the bernoulli and count vignettes and the lme4 vignettes, and discuss prior on covariance matrices) and gamm4
Check size of vignettes (should be fine now with svglite)

Demos

stan_glmer stuff from the second half of the ARM book

have option to draw from the prior of any model

We should just be able to pass N = 0 and skip over the likelihood contribution to the posterior, but it may take a bit of fiddling. People like to see how much the data contributes to the posterior distribution relative to the prior.

stanreg-methods.R not parsing

Some weirdness in the latest edit around lines 424-432.

commit b2c73f0

Check that y is the same for model comparison (loo or waic)

The compare function in the loo package checks that models have the same number of observations, but we can also check that the outcome variable is the same. This would probably require a wrapper around loo::compare that first does this check, which means rstanarm would have its own model comparison function, instead of users just using the loo package. But I'm not sure that doing this check is worth adding a function since compare's check for the number of observations will catch almost all cases of y being different.

posterior_predict fails for stan_glmer if transformations used in formula

@Felipe-Osorio is the error thrown when running the code below the one you were referring to?

fit <- stan_glmer(mpg ~ log1p(wt) + (1|cyl), data = mtcars)
p <- posterior_predict(fit) # ok
p2 <- posterior_predict(fit, newdata = mtcars[1:10, ])
Error in `[.data.frame`(newdata, , colnames(frX)) : 
  undefined columns selected

possibly emulate the gamm4 package / function

The gamm4 R package inputs syntax for estimating generalized additive models, converts it internally to lme4 syntax, and uses lmer to estimate the parameters. It sounds as if (from the appendix of the lme4 vignette and the documentation of gamm4) that we could just copy some of this R code and make a stan_gamm4() function with no additional Stan code and it is quite likely that the decov prior will make the estimates of the smoothing parameters more reliable.

Posterior predict (.pp_data_mer) error with newdata using tbl_df

I noticed the following error when calling posterior_predict with newdata, where my newdata contains a tbl_df (which inherits from data.frame) instead of a data.frame.

Here is a reproducible example:

library(dplyr)
data(mtcars)
mtcars_tbl_df <- tbl_df(mtcars)

m1 <- stan_glmer(qsec ~ 1 + wt + (1 | cyl), data = mtcars)
ppred1_df <- posterior_predict(m1, newdata = mtcars)

m2 <- stan_glmer(qsec ~ 1 + wt + (1 | cyl), data = mtcars_tbl_df)
ppred2_tbl <- posterior_predict(m2, newdata = mtcars_tbl_df)

Which produces the following error:

> ppred2_tbl <- posterior_predict(m2, newdata = mtcars_tbl_df)
Error in .pp_data_mer(object, newdata, ...) : 
  New levels found in grouping variable(s) cyl

However, in this example, the following will work:

ppred2_tbl2 <- posterior_predict(m2, newdata = as.data.frame(mtcars_tbl_df))

This seems like a minor bug to me, since there is such an easy work-around (demonstrated above). Documenting in case it is helpful.

return "clean" stan code to user

Discussion here: https://groups.google.com/forum/?fromgroups#!topic/stan-dev/5pJdH72hoM8

glFormula error

@bgoodri I just triggered an error I didn't know about in glFormula. Apparently it won't allow a model with more random effects than observations because "the random-effects parameters are probably unidentifiable." That's a reasonable concern for glmer, but for stan_glmer this will rule out many models (like a lot of the ones Andrew fits) for which it's possible to overcome this with informative priors.

stan_glmer messed up after recent merge

@bgoodri I'm not sure what's going but after your recent merge for the posterior_predict stuff this example

stan_lmer(Reaction ~ Days + (Days | Subject), data = sleep study)

throws an error. It has something to do with the _NEW_ terms. We're getting b[(Intercept) Subject:_NEW_Subject] but missing the _NEW_ term for Days.

stan_plot cannot exclude "log-posterior"

stan_plot(*, pars = "log-posterior", include = FALSE) does not actually exclude the log-posterior from the plot
stan_plot(*, pars = "lp__", include = FALSE) does, so I assume it is just a naming thing?

add IRT model(s)

I think we are obligated to work IRT for one of Stan's grants so we might as well make it precompileable

add support for multinomial logit models

Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal

I am running the following type of model:

stan_glmer(log_y ~ (1 + log_x | group), 
                   prior = t7_prior, prior_intercept = t7_prior,
                   data = data, iter = 1000, chains = 4)

The dataset had about 20,000 rows with 99 groups.

...
Chain 3, Iteration: 1000 / 1000 [100%]  (Sampling)
#  Elapsed Time: 14649.7 seconds (Warm-up)
#                19422.5 seconds (Sampling)
#                34072.2 seconds (Total)


Warning message:
Warning message:
passing unknown arguments: show_messages. 
passing unknown arguments: show_messages. 
Warning message:
passing unknown arguments: show_messages. 
Warning message:
passing unknown arguments: show_messages. 

Chain 1, Iteration: 1000 / 1000 [100%]  (Sampling)
#  Elapsed Time: 17411.9 seconds (Warm-up)
#                18800 seconds (Sampling)
#                36211.9 seconds (Total)

Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal
Calls: <Anonymous> ... doTryCatch -> sendData -> sendData.SOCK0node -> serialize
In addition: Warning message:
passing unknown arguments: show_messages. 
Execution halted

Session Info:

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.0 (Maipo)

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8          
 [4] LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8       
[10] LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] markdown_0.7.7         shiny_0.12.2           rstanarm_2.8.0         lme4_1.1-10           
 [5] magrittr_1.5           dplyr_0.4.1            rstan_2.8.0            Rcpp_0.12.1           
 [9] Matrix_1.2-1           PricingModel_0.2.1.8-6 netezza_0.6-3.9        rJava_0.9-6           
[13] ggplot2_1.0.1         

loaded via a namespace (and not attached):
 [1] lattice_0.20-31    tidyr_0.2.0        zoo_1.7-12         gtools_3.5.0       assertthat_0.1    
 [6] digest_0.6.8       mime_0.4           R6_2.1.1           plyr_1.8.3         stats4_3.2.1      
[11] coda_0.17-1        lazyeval_0.1.10    minqa_1.2.4        nloptr_1.0.4       R.oo_1.19.0       
[16] DT_0.1             shinythemes_1.0.1  proto_0.3-10       splines_3.2.1      shinyjs_0.2.0     
[21] stringr_1.0.0      htmlwidgets_0.5    loo_0.1.3          munsell_0.4.2      httpuv_1.3.3      
[26] base64enc_0.1-3    htmltools_0.2.6    gridExtra_2.0.0    threejs_0.2.1      matrixStats_0.14.2
[31] MASS_7.3-40        R.methodsS3_1.7.0  grid_3.2.1         nlme_3.1-120       arm_1.8-6         
[36] xtable_1.7-4       gtable_0.1.2       DBI_0.3.1          RJDBC_0.2-5        scales_0.3.0      
[41] rredis_1.6.9       stringi_0.5-5      reshape2_1.4.1     dygraphs_0.4.5     xts_0.9-7         
[46] tools_3.2.1        shinystan_2.0.1    abind_1.4-3        parallel_3.2.1     inline_0.3.14     
[51] colorspace_1.2-6

Sorry, I do not have traceback.

Missing images in cran vignettes

See, for example,
https://cran.rstudio.com/web/packages/rstanarm/vignettes/binomial.html
I can't see any images (just the broken image link symbols)

I just noticed that this was mentioned in the release announcement, but I'll leave this issue here for those who don't read release notes...

make sure all tests pass a seed argument

We now have a non-reproducible failure on r-devel for Windows
https://www.r-project.org/nosvn/R.check/r-devel-windows-ix86+x86_64/rstanarm-00check.html

variable selection by projective prediction

Let's do this for rstanarm

https://groups.google.com/d/msg/stan-users/5IxMxY3tv2w/ptm2vyJmBwAJ

see the repo https://github.com/jtpi/rstan-varsel

default to QR=TRUE for advi?

@bgoodri I've also noticed that doing the QR composition before advi seems to be more reliable, the same you discussed with @akucukelbir in this thread on stan-dev. Do you think we should default to QR=TRUE if the user specifies "meanfield" or "fullrank" as the estimation algorithm?

df.residual

@bgoodri Is there any reason why we should definitely report residual degrees of freedom? Right now we have a call to qr in stanreg that is only used to get the rank of x for computing df.residual. If x is really large we probably don't want to compute the QR decomposition in stanreg unless we really want to report df.residual. What do you think?

allow sigma to vary by group

For models with a Gaussian outcome and some form of grouping structure, allow sigma to vary by level of a group.

Conflict in R2 function with package 'caret'

I realize that you can't avoid all possible conflicts, but just wanted to let you know that R package caret also defines a function R2, which causes really puzzling errors with stan_lm. I don't have a good alternative (you might want to rename hs, etc, as well).

Truncated priors in (g)lm(er) models?

Are there plans to implement the possibility of including truncated priors (for beta coefficients, or even intercept) in (g)lm(er) models, through beta or uniform distributions?

demos

With either devtools::source_url or just source in the latest version of R, it is fairly easy to do many of the example-models via rstanarm with the following code idiom

{
source("https://raw.githubusercontent.com/stan-dev/example-models/*.data.R", local = TRUE)
stan_foo(formula, data = environment(), ...)
}

However, since this requires an internet connection, it cannot be part of R CMD check but I think it would be good to make many demos that do this using best practices (from https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Package-subdirectories ):

The demo subdirectory is for R scripts (for running via demo()) that demonstrate some of the functionality of the package. Demos may be interactive and are not checked automatically, so if testing is desired use code in the tests directory to achieve this. The script files must start with a (lower or upper case) letter and have one of the extensions .R or .r. If present, the demo subdirectory should also have a 00Index file with one line for each demo, giving its name and a description separated by a tab or at least three spaces. (This index file is not generated automatically.)

roxygen2 version 5.0.0 breaks rstanarm build

The error is

Quitting from lines 107-113 (aov.Rmd) 
Error: processing vignette 'aov.Rmd' failed with diagnostics:
'stan_lm' is not an exported object from 'namespace:rstanarm'
Execution halted

which appears to be consistent with the new auto-generated NAMESPACE, which may have to do with these warnings

roxygen2::roxygenize(clean = TRUE)
...
There were 14 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: @export [posterior_predict.R#12]: may only span a single line
2: @export [predict.R#4]: may only span a single line
3: @export [rstanarm-package.R#14]: may only span a single line
4: @export [stan_glm.R#23]: may only span a single line
5: @export [stan_glm.fit.R#26]: may only span a single line
6: @export [stan_glmer.R#7]: may only span a single line
7: @export [stan_glmer.R#113]: may only span a single line
8: @export [stan_lm.R#25]: may only span a single line
9: @export [stan_polr.R#46]: may only span a single line
10: @export [stanreg-methods.R#99]: may only span a single line
11: @export [stanreg-methods.R#227]: may only span a single line
12: @export [stanreg-methods.R#239]: may only span a single line
13: @export [stanreg-methods.R#249]: may only span a single line
14: @export [summary.R#8]: may only span a single line