Comments (13)
I'd like to see efforts to make it as easy as possible for researchers to use the model/data repo and report results. That adds:
- Evaluation scripts that run over all the repo and report results. Hopefully *nix shell scripts will work across Windows/Mac/Linux. I believe that windows has added *nix style scripting, otherwise maybe require Cygwin to be installed.
- Documentation that makes this straightforward.
from example-models.
Link to SBC paper: http://www.stat.columbia.edu/~gelman/research/unpublished/sbc.pdf
from example-models.
Re: input files: Rather than hardcoding synthetic inputs, I think it would be better to deliver a reproducible generator program (i.e., one that gives the same results when run with the same random seed). Advantages of a program:
- It's smaller, and therefore easier to download, store on disk, etc.
- It's self-documenting, making it perfectly clear where the data set came from (sampled from the prior, or with some modifications, or intentionally mis-specified, etc).
- A program can be parameterized, making it easy to do research on data sets that differ only in size, or number of (synthetic) regressors, or etc.
- If the repository needs to evolve, version differences in generator programs are much more interpretable than version differences in generated data sets.
We can also ship a canonical small output from the generator program, in case someone wants to verify that they aren't experiencing version skew.
from example-models.
Re: reproducibility: How feasible would it be (maybe some ways down the line) to provide a Kubernetes configuration (with associated Docker files, etc) that stands up some cloud machines to rerun Stan on all the models, and possibly to run the researcher's method as well? It would be the simplest possible interface: put $ here, and get everything recomputed. Including, for instance, allowing researchers to re-assess Stan on different versions of the models or data sets, etc, just by locally modifying the benchmark repository.
Just an idea to think about; I don't mean to create lots of extra work.
from example-models.
This is not a stan-dev/stan
issue. Could you please move this to either (a) example-models
or (b) stat_comp_benchmarks
or if it's preliminary, to (c) design-docs
. Or to Discourse for discussion until there's a plan for implementation.
from example-models.
Thanks for writing this up, Yuling. Until the issue gets moved, let me reply:
These 270 models represent a reasonable share of the model spac
By what measure? I think we want to look at code coverage of our special functions and for model types. Are there any ODE models in there? Anything with more than a tiny amount of data?
Not all models are runnable for stan sampling.
I think we need to carefully distinguish programs and models. The model is a mathematical concept defining a joint density. The program is a Stan program. For a given model, some programs might work and others might not. For example, a centered parameterization might not work, but a non-centered parameterization might be able to sample with no problem. Now we could argue they're different models because they sample different parameters, so it's not exactly clear to me where to draw the line. For instance, the centered parameterization draws from p(alpha, sigma)
and the non-centered from p(alpha_std, sigma)
. These are different joint densities, related through the transform alpha_std = alpha / sigma
. Are these different models?
Output likelihood in the generated quantity block. We can then run cross-validation by calling LOO
Using an approximation seems risky as a gold-standard evaluation. How reliable is LOO?
Run Stan sampling on all these models, record the mean and sd of the posterior distribution for all parameters (excluding transformed parameters)
Don't we also want standard error (i.e., from ESS) on both mean and posterior sd? That's how @betanalpha is testing in stat_comp_benchmark
. I like the idea of testing with actual square error from simulated parameters as it avoids the approximation of estiating ESS.
[@breckbaldwin] Evaluation scripts that run over all the repo and report results.
A script testing estimation within standard error in stat_comp_benchmarks
.
[@breckbaldwin] Documentation that makes this straightforward.
Of course. The existing scripts are simple---just input CSV for your draws and get output. But interpreting that output assumes some statistical sophistication in understanding MCMC standard error beyond what we'll be able to impart in doc for this tool.
[@axch] Rather than hardcoding synthetic inputs, I think it would be better to deliver a reproducible generator program
I agree. That was part of the original plan so that we could test scalability. The difficulty is always generating predictors for regressions (or any other unmodeled data). We also need a generator to do SBC, though that would most efficiently live in the model file itself, which complicates the issue of what the models should look like.
[@axch] How feasible would it be (maybe some ways down the line) to provide a Kubernetes configuration
I think this is critical as we're going to want to run this during algorithm development. I don't know anything about Kubernetes per se, but we're moving a lot of our testing to AWS with the help of some dev ops consultants. So @seantalts is the one to ask about that as he's managing the consultants.
from example-models.
re: kubernetes - we don't have any plans for such a cluster right now and we'll run out of EC2 credits this November, but we might still be able to pursue such a thing. My first question is around isolation - are kubernetes jobs isolated enough to be a reliable benchmarking platform?
from example-models.
from example-models.
from example-models.
[@axch] input files: Rather than hardcoding synthetic inputs, I think it would be better to deliver a reproducible generator program (i.e., one that gives the same results when running with the same random seed).
Say we fix the random seed and generate the input in R (as it is). Should we worry about R might change its random number generator in the future?
[@bob-carpenter] This is not a
stan-dev/stan
issue. Could you please move this to either (a)example-models
or (b)stat_comp_benchmarks
or if it's preliminary, to (c)design-docs
. Or to Discourse for discussion until there's a plan for implementation.
It should be better transferred to example-models
, which requires permission there. Could you add me writing permission there or you could help me transfer there?
[@bob-carpenter] By what measure? I think we want to look at code coverage of our special functions and for model types. Are there any ODE models in there? Anything with more than a tiny amount of data?
No there are no ODE models and no large data. Nevertheless, it should still be better than running a few linear regressions? In the ideal situation, we should wish to have a better coverage of models and users could also contribute to the model list too.
[@bob-carpenter] For instance, the centered parameterization draws from p(alpha, sigma) and the non-centered from p(alpha_std, sigma). These are different joint densities, related through the transform alpha_std = alpha / sigma. Are these different models?
I agree. It is indeed a combination of model, parametrization, and sampling scheme. Fortunately, we have enough diagnostics that should be able to tell whether stan sampling result is enough for one example model within that parametrization form, and this all I am intended to say for this point.
[@bob-carpenter] But aren't predictions just different random variables with their own posteriors? I don't see how it's such a different use case. Unless you mean using something like 0/1 loss instead of our usual log loss.
I think it is the difference between the estimation of parameters (given the model) vs the prediction of future outcome. As an obvious a caveat, A wrong inference in the wrong model might do better than a correct inference.
from example-models.
from example-models.
I don't understand the difference. A prediction is coded as a parameter in Stan (or a generated quantity if it's simple enough and can be factored). Either way it's a random variable we can sample in the posterior conditioned on observed data.
Right, predictions can be viewed as generated quantities of parameters, and parameters can also be interpreted as predictive quantities. But in the practical level, we could always treat Stan sampling as the true posterior distribution for parameters and therefore a benchmark for all approximation method whenever Stan sampling passes all diagnostics— there can only be one truth. For prediction, It is unnecessary to match the exact posterior predictive distributions in Stan sampling as they are not necessarily optimal in the first place. We have seen examples in which an approximation method gives problematic posterior distribution of parameters but still ok or even better prediction.
from example-models.
from example-models.
Related Issues (20)
- `stan_demo` example 500 not working HOT 4
- Missing function in Introduction
- Error in BPA Ch.10 js_super.stan with possible fix HOT 1
- rats ref link at top of stan files is broken
- Add .data.R files for knitr models HOT 1
- Rdump format
- sampling() fails in Lotka-Volterra example HOT 3
- Lotka-Volterra example not running HOT 1
- BUGS Background Information for Litter Example HOT 1
- Failed to build Lotka Vollterra on Mac OS / clang HOT 9
- reduce_sum case study will be out of date once new cmdstanr pull gets merged
- Example script `fit_nyc_bym2.R` to use model `bym2_offset_only.stan` doesn't work HOT 1
- Subtle error in some capture-recapture likelihoods HOT 3
- [Error] Validate transformed params
- Survey.stan example from Cognitive Modeling - Expected end of file after end of generated quantities block. HOT 9
- Which book/chapter HOT 1
- updates to ARM model for normal_id_glm - variable name "cov" should be "x_as_mat" or something like that HOT 6
- add Python scripts to BYM12 case study HOT 5
- Parameter names for stan file HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from example-models.