transforms's People
Forkers
bob-carpentertransforms's Issues
correlation matrix transforms
Add analysis of correlation matrix transforms
- implement Stan's correlation matrix transform as a Stan program
- implement the alternative from @adamhaber used in TensorFlow and discussed here: tensorflow/probability#694
- implement test models
- LKJ of various parameterizations
- ???
Stiefel manifold transforms
Collected from e-mail discussion:
transforms:
- see what geotorch does
- Cayley method (https://arxiv.org/abs/1810.02881)
- Polar expansion (https://arxiv.org/abs/1906.07684)
- Givens transform (https://doi.org/10.1214/20-BA1202)
- Householder-based method (novel?)
test cases (TBD):
more stable stick breaking
This calculates things on the log-scale and uses a streaming version of log_sum_exp
to get the cumulative log_sum_exp
. The calculation proceeds by first noticing that
All that needs to be done is making a cumulative LSE. We can do this by keeping track of the running max and update whenever we get a new max for the next element.
I will add the PR next.
functions {
vector cumulative_logsumexp (vector x) {
real running_max = negative_infinity();
real r = 0;
int N = num_elements(x);
vector[N] y;
for (n in 1:N) {
if (x[n] <= running_max) {
r += exp(x[n] - running_max);
} else {
r *= exp(running_max - x[n]);
r += 1.;
running_max = x[n];
}
y[n] = log(r) + running_max;
}
return y;
}
vector break_that_stick_lp(vector stick_slices) {
int K = num_elements(stick_slices) + 1;
vector[K] log_pi;
real logabsjac = 0;
vector[K - 1] z = stick_slices - log(reverse(linspaced_vector(K - 1, 1, K - 1)));
log_pi[1:K - 1] = log_inv_logit(z);
logabsjac += sum(log_pi[1:K - 1]);
log_pi[K] = 0;
log_pi[2:K] += cumulative_sum(log1m_inv_logit(z));
logabsjac += log_pi[K];
logabsjac += sum(log1m_exp(cumulative_logsumexp(log_pi[1:K - 2])));
target += logabsjac;
return log_pi;
}
}
data {
int<lower=0> N;
vector<lower=0>[N] alpha;
}
parameters {
vector[N - 1] y;
}
transformed parameters {
simplex[N] x = exp(break_that_stick_lp(y));
}
model {
}
ess per leapfrog step as the x axis in the cdf plot
I needed some quick clarification here: you get the ess after sampling is done, and the ess plot @sethaxen made is from concatenating ESS values from multiple sampling processes. Leapfrog steps are per iteration? so for the plot should I be looking at mean leapfrog step w.r.t every ESS value? I'm unsure
cc: @bob-carpenter
Final Cleanup of code
- Cleanup code properly - e.g. remove the draft scripts
- check if bounded transforms or other ones work properly in this mechanism
- get plots from n=100 repeated versions of the transforms
ESS/nsteps vs ESS/s
In #8 we agreed on plotting KDE of ESS/nsteps, but I just remembered that this is still only a proxy for what we are really probably interested in, which is how much computational work is required to get the target ESS. e.g. take transforms f
and g
. If the gradient of f
is 10x more expensive than that of g
, but f
has double the ESS/nsteps, Then we would prefer f
if comparing ESS/nsteps but clearly prefer g
if looking at ESS/s. ESS/s captures both, so perhaps that's a more relevant metric to report.
cleanup of paper
- remove duplicate stuff (comment it)
- check for incorrectness
- check for notational inconsistency
- add plots and captions suitably
Include a test on log_simplex transforms?
I was doing a bit of testing and for high dimensional values of the simplex the sampler often fails. A separate issue is that the dirichlet
distribution with these high dimensional vectors.
For example, the probit product transform fails for dimensions > 1000 (maybe even > 800) but the log probit product doesn't.
data {
int<lower=0> N;
vector<lower=0>[N] alpha;
}
parameters {
vector[N - 1] y;
}
transformed parameters {
vector[N] log_x;
real log_det_jacobian = 0;
{
real yi;
real log_zi, log_zprod = 0, log_zprod_new;
for (i in 1:N - 1) {
yi = y[i];
log_zi = std_normal_lcdf(yi | );
log_det_jacobian += std_normal_lpdf(yi |);
log_zprod_new = log_zprod + log_zi;
log_det_jacobian += log_zprod;
log_x[i] = log_diff_exp(log_zprod, log_zprod_new);
log_zprod = log_zprod_new;
}
log_x[N] = log_zprod;
}
}
model {
target += log_det_jacobian;
// target += dirichlet_lpdf(x | alpha);
}
Do we want to include tests with the transforms written on the log-scale?
Appendix
Meenal mentioned yesterday that this paper may be used as a starting point for a list of transforms. You all mind if I start an appendix with stuff that's not in the main paper? We can always break that out into a separate doc later. I just don't want to pollute the main doc if the focus is going to be on simplices.
three more simplex transforms to evaluate
Suggestions from @mortonjt on three more transforms to consider with notes on who volunteered to look at them:
-
weighted ilr (Bob): https://elifesciences.org/articles/21887
-
pivot coordinates (Meenal): https://www.researchgate.net/publication/316057918_Weighted_Pivot_Coordinates_for_Compositional_Data_and_Their_Application_to_Geochemical_Mappin
g -
L2 projection (Seth): https://arxiv.org/abs/1602.02068
Running all the transforms models on the cluster x100 times
Update: I already assigned the dirichlet model for all simplex transforms, once that comes out fine we can start with more
TO-DO: do this with other exponential family dists
Softmax augmented lower bound 0 on y
I noticed that sampling and uniform output is perfect for N > 2 when I enforce a lower bound of 0 on y. The case of N = 2 has one lonely divergence on default Stan settings.
Cleanup code properly - e.g. remove the draft scripts
symmetric positive definite matrix transforms (aka covariance matrices)
- implement SPD transforms as Stan programs
- code test distribution programs
- Wishart
- inverse Wishart
- something spectral (distribution on eigenvalues)?
The set of SPD matrices is not compact, so we need to add proper distributions. The Wisharts are conjugate for covariance in a multivariate normal model, so they're essentially like testing multivariate normal posteriors.
Info about scripts
sample.py describes all the relevant function args used everywhere more or less
example.ipynb briefly summarises how to make all the plots or evaluate things in a standardized way. there are example plots.png files as well.
this is far far from perfect but for now it works and if you want to improve these please feel free to do so. just put in an example in the notebook so everybody can check how to use things.
TODO left #25 : in the morning i'll add the stan functions as bob suggested so switching target densities is easier. I wanna get this stuff out of the way before looking at newer things so I don't spend too much time on this when checking other transforms.
Draft of Theory of transforms
Draft of Theory of transforms and context of HMC (this will overlap with Introduction somewhat)
Bounded transforms
simpler transforms for lower and upper bounds work ok from a preliminary check, will put them in a folder with separate files for lower, upper and lower+upper soon.
Minor notational point
In the paper, we define the
Geometry of transforms
understanding the geometry of transforms better:
- behavior of tail
- convexity
- (needs more thought and discussion)
Tasks for this week (27th June - 3rd July)
-
Draft of Introduction
-
Draft of Transforms section (Describing the theory of the transforms - i've been looking up original mentions of them for references)
-
Plotting the CDF and rmse/leapfrog after sampling them maybe a 1000 times and getting the average plots
-
Bob suggested trying ess per leapfrog step as the x axis in the cdf plot
-
understanding the geometry better:
- [ ] behavior of tail
- [ ] convexity
- [ ] (needs more thought and discussion) -
Starting with simpler transforms like exp or something for lower upper constraints
Note: Depending on how the actual process goes we might not be able to do all of this but I'd want us to do like 4 out of 6 maybe?
softmax log-det-jac
In the paper and code is -N * log1p_exp(log_sum_exp(y)) + sum(y)
there's a missing 0.5 * log(N)
somewhere. It's a constant so doesn't affect sampling but we should have it in. When I AD through it comes up with that constant.
PR #46 has a more efficient version (I think, it's more efficient).
check if bounded transforms or other ones work properly with current sampling and plot scripts
Ordered simplex
See https://discourse.mc-stan.org/t/ordered-simplex-constraint-transform/24102/17?u=spinkney
I'll add the Stan code for this. Martin helped a lot and this paper made it clear for me https://arxiv.org/pdf/0708.0176.pdf
Draft of Simplex Section
- Simplex and it's properties + definition
- transforms on simplex and their jacobians
- evaluation results for simplex transforms (plots and verbose justification)
Testing Jacobian determinants
I've been locally testing my Jacobian derivations in Julia with AD, but ideally our Stan implementations would be tested for correctness as well by computing the Jacobian of the actual transforms using AD and comparing its log-abs-det with the log log-abs-det-jac terms returned by the implementations. Is this straightforward to do? This is outside of my Stan expertise.
add stan functions for target density
Unit vector transforms
bijective transforms using charts:
- spherical coordinates
augmented transforms:
- canonical embedding in real space (several alternative priors for radius)
test cases:
- uniform -> concentrated (von Mises-fisher)
- bimodal (axial)
- girdle
- non-isotropic (Fisher-Bingham with varying skew)
ordered vector abs
I'm finding that using the absolute value to increment is more stable than
is constant.
I have to put a prior on x or else it's improper. When I compare to the built-in I notice the built-in has 1 or 2 divergences with a normal prior (N(0, 10)) and this one doesn't.
data {
int<lower=0> N;
vector<lower=0>[N] alpha;
}
parameters {
vector[N] y;
}
transformed parameters {
ordered[N] x ;
x[1] = y[1];
for(i in 2:N) {
x[i] = x[i - 1] + abs(y[i]);
}
}
model {
target += target_density_lp(x, alpha);
}
does the stick breaking Jacobian need 0.5 * log(N)?
When I AD through the stick breaking transform in Julia I need to add 0.5 * log(N)
to get the same log-det-jacobian.
import ForwardDiff
using LinearAlgebra
using LogExpFunctions
function stick_break(x)
Nm1 = size(x)[1]
z = logistic.(x .- log.(LinRange(Nm1, 1, Nm1)))
cum_sum = 0
for n in 2:Nm1
cum_sum += z[n - 1]
z[n] *= (1 - cum_sum)
end
return [z ; 1 - (cum_sum + z[Nm1])]
end
x = [1., 2., -0.4]
J = ForwardDiff.jacobian(x -> stick_break(x), x)
logdet(J' * J) * 0.5 # -6.74405991829611
# what we have
N = size(x)[1] + 1
z = logistic.(x .- log.(LinRange(size(x)[1], 1, size(x)[1])))
our_jac_det = sum(log.(z) + log1p.(-z) + log1p.(-cumsum([0 ; s[1:N - 2]]))) # -7.437207098856055
# add in 0.5 * log(N)
out_jac_det + 0.5 * log(N) # -6.744059918296109
Hyperspherical parameterization of the simplex
As noted in an earlier e-mail, if
As mentioned in #1, using exp
to make each element positive in the end gives us the augmented softmax parameterization.
We can do this as well with the hyperspherical transform, which is written
where
The requirement that
It turns out that the Jacobian determinant here is
I suspect due to the trigonometric functions, it may not be extremely efficient. However, all of these trigonometric functions are bijective over this range, so we can replace sines or cosines with numbers constrained to the interval
I'll shortly add Stan functions for both of these.
init values for sampling
so for the hyperspherical.stan and hypersphericalangular.stan default initialization fails and something very close to 0 works.
now the question being, would it make sense to use a modified init that would work for all the transforms and report that in the standard results and the other modifications that we try in the separate discussion for results?
or does it make sense to check the init correlation with plots for every transform and imperfectly choose the best initialization for each transform and try to justify why.
get plots for data from cluster (sampling repeated 100 times or so for every transform)
parametrization of softmax-augmented
@sethaxen if i remember correctly you suggested using p=1/N for the augmented softmax, I can see that the RMSE plots for that version are near straight lines or weird curves in some parametrizations and alright in some, the error isn't high or something but yeah. they come out similar to the rest when i take p=0.5 or something.
in contrast with this for p=0.5. do you have any thoughts about which values of p we should we go for in the actual paper
Draft of Introduction
stuff to sample/plot
Transforms for complex manifolds
A few of the manifolds we have (the sphere, symmetric positive definite matrices, and Stiefel manifold) have straightforward complex analogs that appear in statistical applications, and the transforms for these manifolds can be generalized to map to these complex spaces with likely very little additional effort. I suggest we do so for completeness since Stan now has complex number support.
Since I've been working on these manifolds, I would take care of this generalization, but before putting in the effort, I want to confirm if others think this could be within the scope of the paper.
A few notes:
- the complex sphere
ℂⁿ
is equivalent to the real sphereℝ²ⁿ
, so we don't need to do anything special here. - symmetric positive definite matrices generalize to hermitian positive definite matrices. Everything is the same, except eigenvectors are now unitary instead of orthogonal, and the sub-diagonal of the lower Cholesky factor is complex. We could test with the complex Wishart distribution
- Stiefel: the semi-orthogonal matrices become semi-unitary. This allows for potentially better transforms. e.g. the Orthogonal group O(N) is a subgroup of unitary U(N), but it has 2 disconnected submanifolds (the ones with +1 and -1 determinant). Thus any transform from unconstrained space to O(N) must use parameter expansion to connect the 2 submanifolds; bijective maps like matrix exponential can only cover one of the submanifolds. However, unitary matrices have only the constraint that the absolute value of the determinant is 1 (i.e. it the determinant lives on the circle in the complex plane), which smoothly connects the submanifolds with +1 and -1 determinant, so parameter expansion is not needed, and bijective maps are possible. Distributions like Matrix von Mises-Fisher and Bingham have complex analogs we could test with.
sum to zero transform
@spinkney proposed a new sum-to-zero transform:
stan-dev/stanc3#1111 (comment)
Given time, we should evaluate this as well as the one we currently use (which sets the last value to the negation of the sum of the first values).
Inverse isometric log ratio simplex method
I can't find where @bob-carpenter mentioned that he was still trying to understand this transform but it piqued my curiosity because I hadn't heard of it.
After grokking the idea I was able to put together an R script that generates a simplex using it. The idea connects unit vector and simplex stuff together. The following is the reverse of the ILR so it's a bit weird. Needing to invert the Helmert matrix - which enforces the the sum-to-zero constraint in the forward mapping.
make_V <- function(D) {
apply(t(contr.helmert(D)), 1, function(x) x /norm(x))
}
clr_trans <- function(x) {
D <- length(x)
log(x) - sum(log(x)) / D
}
random_simplex <- function(D) {
x_test_ilr <- runif(D - 1, -5, 5)
# To reverse transform
# append a row with all 0s except the last as 1 to make V matrix invertible
V <- make_V(D)
V_fullrank <- rbind(t(V), c(rep(0, D - 1), 1))
V_inv_fullrank <- solve(V_fullrank)
# append 0 on the ILR values
x_test_ilr_zero <- c(x_test_ilr, 0)
exp_s <- exp(V_inv_fullrank %*% x_test_ilr_zero)
exp_s
alpha <- sum(exp_s[1:D-1])
Z <- -log1p(alpha)
# values match x_test
return(exp_s * exp(Z))
}
random_simplex(5)
[,1]
1 0.020302134
2 0.039406348
3 0.704896030
4 0.005134803
5 0.230260686
I've written a Stan program without the log-det-jac adjustment because I'm tired and I think @sethaxen may be able to whip it up much quicker than me. I put a normal prior just to get sampling going.
data {
int<lower=0> N;
matrix[N - 1 , N - 1] Vinv;
}
parameters {
vector[N - 1] y;
}
transformed parameters {
simplex[N] z;
{
vector[N - 1] s = Vinv * y;
real alpha = log_sum_exp(append_row(s, 0));
z = append_row(exp(s - alpha), 1 / exp(alpha));
}
}
model {
y ~ normal(0, 10);
}
generated quantities {
real z_sum = sum(z);
}
To run this you need
make_V <- function(D) {
apply(t(contr.helmert(D)), 1, function(x) x /norm(x))
}
V <- make_V(N)
V_fullrank <- rbind(t(V), c(rep(0, N - 1), 1))
V_inv_fullrank <- solve(V_fullrank)
mod_out <- mod$sample(
data = list(N = N,
Vinv = matrix(V_inv_fullrank[1:(N-1), 1:(N-1)], N-1, N-1)),
parallel_chains = 4,
iter_sampling = 2000
)
mod_out$summary("z")
and example output of
mod_out$summary("z")
# A tibble: 5 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 z[1] 0.201 0.0000368 0.371 0.0000546 6.36e-14 1.00 1.00 8449. 6023.
2 z[2] 0.206 0.0000382 0.374 0.0000567 3.85e-14 1.00 1.00 9083. 6334.
3 z[3] 0.199 0.0000351 0.369 0.0000521 4.16e-14 1.00 1.00 8053. 6443.
4 z[4] 0.194 0.0000241 0.365 0.0000357 4.58e-14 1.00 1.00 7887. 6253.
5 z[5] 0.201 0.0000500 0.371 0.0000742 3.62e-14 1.00 1.00 7808. 6525.
mathematically defining transforms
The paper's getting a bit out of hand with everyone writing in their own style. I'm trying to work with Meenal now to strip out most of what's there and then build up a draft with coherent notation. As such, the first thing we need to do is define transforms. To me, the goal of the whole business is that when we sample on the unconstrained scale and transform, we get a uniform distribution on the constrained scale. So informally, we want something like this.
Suppose
$\mathcal{Y}$ is a domain of unconstrained parameters and$\mathcal{X}$ is a domain of constrained parameters. For now,$\mathcal{Y} = \mathbb{R}^N$ for some$N$ and$\mathcal{X} \subseteq \mathbb{R}^M$ for some$M$ . We further suppose a (perhaps improper) density function$p_Y: \mathcal{Y} \rightarrow (0, \infty)$ with support over all elements of$\mathcal{Y}$ . We further suppose we have a surjective (onto) map$g:\mathcal{Y} \rightarrow \mathcal{X}$ . Informally, we want to ensure that if we have$Y \sim p_Y(\cdot)$ , then$g(Y) \sim \textrm{uniform}(\mathcal{X}).$
The reason this definition doesn't work is that
Definition (Constraining Transform): Suppose
$\mathcal{Y}$ is a domain of unconstrained parameters and$\mathcal{X}$ is a domain of constrained parameters. Further suppose a (perhaps improper) density function$p_Y: \mathcal{Y} \rightarrow (0, \infty)$ with support over all elements of$\mathcal{Y}$ and a surjective map$g:\mathcal{Y} \rightarrow \mathcal{X}$ . The pair$(p_Y, g)$ is a constraining transform from$\mathcal{Y}$ to$\mathcal{X}$ if for any proper density$\pi_X(x)$ defined over$\mathcal{X}$ , the density defined by$p_Y(y) = \pi_Y(y) \cdot \pi_X(g(y))$ is proper and if$Y \sim \pi_Y$ , then$X = g(Y) \sim \pi_X$ .
We can then go on and define the special case of one-one changes of variables.
Lemma (Change of Variables): Suppose
$\mathcal{Y} = \mathbb{R}^N$ ,$\mathcal{X} \subseteq \mathbb{R}^N$ ,$\pi_X:\mathcal{X} \rightarrow (0, \infty)$ is a proper density, and$f:\mathcal{X} \rightarrow \mathcal{Y}$ is a differentiable function. If we let$\pi_Y(y) = \left| \textrm{det} \frac{\partial}{\partial y} f^{-1}(y) \right|$ , then the pair$(\pi_Y, f^{-1})$ is a constraining transform. Proof: From the change of variables formula,$p_Y(y) = p_X(f^{-1}(y)) \left| \textrm{det} \frac{\partial}{\partial y} f^{-1}(y) \right|.$
We can then move on to the example of simplexes, where we are defining a transform from
Writing the paper using Overleaf
Are you guys interested in trying Overleaf for collaborative writing?
I've used the current .tex file to generate this project: https://www.overleaf.com/6115381575pygbmdbkncqn
My experience with Overleaf has been really good so far; I think it can make it easier to ask questions, make small edits etc without making too many PRs / unreviewed edits. What do think?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.