mjhajharia / transforms Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 1.0 35.9 MB

Stan 7.98% Python 7.93% Jupyter Notebook 82.89% Shell 1.21%

transforms's People

Contributors

Stargazers

Watchers

Forkers

bob-carpenter

transforms's Issues

correlation matrix transforms

Add analysis of correlation matrix transforms

implement Stan's correlation matrix transform as a Stan program
implement the alternative from @adamhaber used in TensorFlow and discussed here: tensorflow/probability#694
implement test models
- LKJ of various parameterizations
- ???

Stiefel manifold transforms

Collected from e-mail discussion:

transforms:

see what geotorch does
Cayley method (https://arxiv.org/abs/1810.02881)
Polar expansion (https://arxiv.org/abs/1906.07684)
Givens transform (https://doi.org/10.1214/20-BA1202)
Householder-based method (novel?)

test cases (TBD):

more stable stick breaking

This calculates things on the log-scale and uses a streaming version of log_sum_exp to get the cumulative log_sum_exp. The calculation proceeds by first noticing that

$$ \begin{align} &y = \log ( 1 - \sum \exp(x)) \\ &1 - \exp(y) = \sum \exp(x) \\ &\log(1 - \exp(y)) = \text{LSE}(x) \end{align} $$

All that needs to be done is making a cumulative LSE. We can do this by keeping track of the running max and update whenever we get a new max for the next element.

I will add the PR next.

 functions {
  vector cumulative_logsumexp (vector x) {
      real running_max = negative_infinity();
      real r = 0;
      int N = num_elements(x);
      vector[N] y;
      
      for (n in 1:N) {
          if (x[n] <= running_max) {
              r += exp(x[n] - running_max);
          } else {
              r *= exp(running_max - x[n]);
              r += 1.;
              running_max = x[n];
          }
          y[n] = log(r) + running_max;
      }
      return y;
  }    

  vector break_that_stick_lp(vector stick_slices) {
    int K = num_elements(stick_slices) + 1;
    vector[K] log_pi;
    real logabsjac = 0;
    vector[K - 1] z = stick_slices - log(reverse(linspaced_vector(K - 1, 1, K - 1)));
    
    log_pi[1:K - 1] = log_inv_logit(z);

    logabsjac += sum(log_pi[1:K - 1]);

    log_pi[K] = 0;
    log_pi[2:K] += cumulative_sum(log1m_inv_logit(z));
    
    logabsjac += log_pi[K];
    logabsjac += sum(log1m_exp(cumulative_logsumexp(log_pi[1:K - 2])));
    
    target += logabsjac;

    return log_pi;
  }

}
  
  data {
  int<lower=0> N;
  vector<lower=0>[N] alpha;
}
parameters {
  vector[N - 1] y;
}
transformed parameters {
  simplex[N] x = exp(break_that_stick_lp(y));
}
model {
}

ess per leapfrog step as the x axis in the cdf plot

I needed some quick clarification here: you get the ess after sampling is done, and the ess plot @sethaxen made is from concatenating ESS values from multiple sampling processes. Leapfrog steps are per iteration? so for the plot should I be looking at mean leapfrog step w.r.t every ESS value? I'm unsure

cc: @bob-carpenter

Final Cleanup of code

Cleanup code properly - e.g. remove the draft scripts
check if bounded transforms or other ones work properly in this mechanism
get plots from n=100 repeated versions of the transforms

ESS/nsteps vs ESS/s

In #8 we agreed on plotting KDE of ESS/nsteps, but I just remembered that this is still only a proxy for what we are really probably interested in, which is how much computational work is required to get the target ESS. e.g. take transforms f and g. If the gradient of f is 10x more expensive than that of g, but f has double the ESS/nsteps, Then we would prefer f if comparing ESS/nsteps but clearly prefer g if looking at ESS/s. ESS/s captures both, so perhaps that's a more relevant metric to report.

cleanup of paper

remove duplicate stuff (comment it)
check for incorrectness
check for notational inconsistency
add plots and captions suitably

Include a test on log_simplex transforms?

I was doing a bit of testing and for high dimensional values of the simplex the sampler often fails. A separate issue is that the dirichlet distribution with these high dimensional vectors.

For example, the probit product transform fails for dimensions > 1000 (maybe even > 800) but the log probit product doesn't.

data {
 int<lower=0> N;
 vector<lower=0>[N] alpha;
}
parameters {
  vector[N - 1] y;
}
transformed parameters {
  vector[N] log_x;
  real log_det_jacobian = 0;
  {
    real yi;
    real log_zi, log_zprod = 0, log_zprod_new;

    for (i in 1:N - 1) {
      yi = y[i];
      log_zi = std_normal_lcdf(yi | );
      log_det_jacobian += std_normal_lpdf(yi |);
      log_zprod_new = log_zprod + log_zi;
      log_det_jacobian += log_zprod;
      log_x[i] = log_diff_exp(log_zprod, log_zprod_new);
      log_zprod = log_zprod_new;
    }
    log_x[N] = log_zprod;
  }
}
model {
  target += log_det_jacobian;
 // target += dirichlet_lpdf(x | alpha);
}

Do we want to include tests with the transforms written on the log-scale?

Appendix

Meenal mentioned yesterday that this paper may be used as a starting point for a list of transforms. You all mind if I start an appendix with stuff that's not in the main paper? We can always break that out into a separate doc later. I just don't want to pollute the main doc if the focus is going to be on simplices.

@mjhajharia @bob-carpenter @sethaxen

three more simplex transforms to evaluate

Suggestions from @mortonjt on three more transforms to consider with notes on who volunteered to look at them:

weighted ilr (Bob): https://elifesciences.org/articles/21887
pivot coordinates (Meenal): https://www.researchgate.net/publication/316057918_Weighted_Pivot_Coordinates_for_Compositional_Data_and_Their_Application_to_Geochemical_Mappin
g
L2 projection (Seth): https://arxiv.org/abs/1602.02068

Running all the transforms models on the cluster x100 times

Update: I already assigned the dirichlet model for all simplex transforms, once that comes out fine we can start with more

TO-DO: do this with other exponential family dists

Softmax augmented lower bound 0 on y

I noticed that sampling and uniform output is perfect for N > 2 when I enforce a lower bound of 0 on y. The case of N = 2 has one lonely divergence on default Stan settings.

Cleanup code properly - e.g. remove the draft scripts

symmetric positive definite matrix transforms (aka covariance matrices)

implement SPD transforms as Stan programs
code test distribution programs
- Wishart
- inverse Wishart
- something spectral (distribution on eigenvalues)?

The set of SPD matrices is not compact, so we need to add proper distributions. The Wisharts are conjugate for covariance in a multivariate normal model, so they're essentially like testing multivariate normal posteriors.

Info about scripts

sample.py describes all the relevant function args used everywhere more or less

example.ipynb briefly summarises how to make all the plots or evaluate things in a standardized way. there are example plots.png files as well.

this is far far from perfect but for now it works and if you want to improve these please feel free to do so. just put in an example in the notebook so everybody can check how to use things.

TODO left #25 : in the morning i'll add the stan functions as bob suggested so switching target densities is easier. I wanna get this stuff out of the way before looking at newer things so I don't spend too much time on this when checking other transforms.

CC: @bob-carpenter @bob-carpenter @adamhaber

Draft of Theory of transforms

Draft of Theory of transforms and context of HMC (this will overlap with Introduction somewhat)

Bounded transforms

simpler transforms for lower and upper bounds work ok from a preliminary check, will put them in a folder with separate files for lower, upper and lower+upper soon.

Minor notational point

In the paper, we define the $N$-simplex as being an $N+1$-length vector. In our model implementations, we use $N$ to refer to the number of elements in the vector on the simplex. We should harmonize these notations.

Geometry of transforms

understanding the geometry of transforms better:

behavior of tail
convexity
(needs more thought and discussion)

Tasks for this week (27th June - 3rd July)

Draft of Introduction
Draft of Transforms section (Describing the theory of the transforms - i've been looking up original mentions of them for references)
Plotting the CDF and rmse/leapfrog after sampling them maybe a 1000 times and getting the average plots
Bob suggested trying ess per leapfrog step as the x axis in the cdf plot
understanding the geometry better:
- [ ] behavior of tail
- [ ] convexity
- [ ] (needs more thought and discussion)
Starting with simpler transforms like exp or something for lower upper constraints

Note: Depending on how the actual process goes we might not be able to do all of this but I'd want us to do like 4 out of 6 maybe?

softmax log-det-jac

In the paper and code is -N * log1p_exp(log_sum_exp(y)) + sum(y) there's a missing 0.5 * log(N) somewhere. It's a constant so doesn't affect sampling but we should have it in. When I AD through it comes up with that constant.

PR #46 has a more efficient version (I think, it's more efficient).

check if bounded transforms or other ones work properly with current sampling and plot scripts

Ordered simplex

See https://discourse.mc-stan.org/t/ordered-simplex-constraint-transform/24102/17?u=spinkney

I'll add the Stan code for this. Martin helped a lot and this paper made it clear for me https://arxiv.org/pdf/0708.0176.pdf

Draft of Simplex Section

Simplex and it's properties + definition
transforms on simplex and their jacobians
evaluation results for simplex transforms (plots and verbose justification)

Testing Jacobian determinants

I've been locally testing my Jacobian derivations in Julia with AD, but ideally our Stan implementations would be tested for correctness as well by computing the Jacobian of the actual transforms using AD and comparing its log-abs-det with the log log-abs-det-jac terms returned by the implementations. Is this straightforward to do? This is outside of my Stan expertise.

add stan functions for target density

Unit vector transforms

bijective transforms using charts:

spherical coordinates

augmented transforms:

canonical embedding in real space (several alternative priors for radius)

test cases:

uniform -> concentrated (von Mises-fisher)
bimodal (axial)
girdle
non-isotropic (Fisher-Bingham with varying skew)

ordered vector abs

I'm finding that using the absolute value to increment is more stable than $\exp(\cdot)$. The jacobian adjustment
is constant.

I have to put a prior on x or else it's improper. When I compare to the built-in I notice the built-in has 1 or 2 divergences with a normal prior (N(0, 10)) and this one doesn't.

data {
 int<lower=0> N;
 vector<lower=0>[N] alpha;
}
parameters {
 vector[N] y;
}
transformed parameters {
 ordered[N] x ;

  x[1] = y[1];
  for(i in 2:N) {
    x[i] = x[i - 1] + abs(y[i]);
  }
}
model {
 target += target_density_lp(x, alpha);
}

does the stick breaking Jacobian need 0.5 * log(N)?

When I AD through the stick breaking transform in Julia I need to add 0.5 * log(N) to get the same log-det-jacobian.

import ForwardDiff
using LinearAlgebra
using LogExpFunctions

function stick_break(x)
    Nm1 = size(x)[1]
    z = logistic.(x .- log.(LinRange(Nm1, 1, Nm1))) 
    cum_sum = 0

    for n in 2:Nm1 
        cum_sum += z[n - 1]
        z[n] *= (1 - cum_sum) 
    end

    return [z ; 1 - (cum_sum + z[Nm1])]
end

x = [1., 2., -0.4]
J = ForwardDiff.jacobian(x -> stick_break(x), x)
logdet(J' * J) * 0.5 # -6.74405991829611

# what we have
N = size(x)[1] + 1
z = logistic.(x .- log.(LinRange(size(x)[1], 1, size(x)[1])))
our_jac_det = sum(log.(z) + log1p.(-z) + log1p.(-cumsum([0 ; s[1:N - 2]]))) # -7.437207098856055

# add in 0.5 * log(N)
out_jac_det + 0.5 * log(N) # -6.744059918296109

Hyperspherical parameterization of the simplex

As noted in an earlier e-mail, if $z$ is a unit vector of length $N$, then squaring each element so $x_i = z_i^2$ gives us a point $x$ on the simplex. If all elements of $z$ are positive (i.e. $z$ is on the positive orthant of the hypersphere), then this map is bijective. So if we can generate unit vectors with positive elements, we can use that transform to also sample on the simplex.

As mentioned in #1, using exp to make each element positive in the end gives us the augmented softmax parameterization.

We can do this as well with the hyperspherical transform, which is written

$$ z_i = \left(\prod_{j=1}^{i-1} \sin(\phi_j)\right) \begin{cases} 1 & \text{ if } i = N\\ \cos(\phi_i) & \text{otherwise} \end{cases}, $$

where $\phi_j$ is in $(0, 2\pi]$ when $j=N-1$ and $(0, \pi]$ otherwise.

The requirement that $z_i$ be positive for $i < N-1$ is satisfied by requiring $\phi_i \in (0, \pi/2]$ for all $i$.

It turns out that the Jacobian determinant here is

$$ \begin{aligned} |J| &= 2^{N-1} \left(\prod_{i=1}^{N-2} \sin(\phi_i)^{N-i-1}\right) \left(\prod_{i=1}^N z_i \right)\\ &= 2^{N-1} \prod_{i=1}^{N-1}\tan(\phi_i) x_i\\ &= 2^{N-1} \prod_{i=1}^{N-1}\sin^{2(N-i)-1}(\phi_i) \cos(\phi_i). \end{aligned} $$

I suspect due to the trigonometric functions, it may not be extremely efficient. However, all of these trigonometric functions are bijective over this range, so we can replace sines or cosines with numbers constrained to the interval $(0, 1)$ to get a new transform.

I'll shortly add Stan functions for both of these.

init values for sampling

so for the hyperspherical.stan and hypersphericalangular.stan default initialization fails and something very close to 0 works.

now the question being, would it make sense to use a modified init that would work for all the transforms and report that in the standard results and the other modifications that we try in the separate discussion for results?

or does it make sense to check the init correlation with plots for every transform and imperfectly choose the best initialization for each transform and try to justify why.

cc: @sethaxen @bob-carpenter @spinkney

get plots for data from cluster (sampling repeated 100 times or so for every transform)

parametrization of softmax-augmented

@sethaxen if i remember correctly you suggested using p=1/N for the augmented softmax, I can see that the RMSE plots for that version are near straight lines or weird curves in some parametrizations and alright in some, the error isn't high or something but yeah. they come out similar to the rest when i take p=0.5 or something.

in contrast with this for p=0.5. do you have any thoughts about which values of p we should we go for in the actual paper

Draft of Introduction

stuff to sample/plot

list of transforms to sample:

augmented-ilr
hyperspherical_angular
logistic
probit

plots currently:

cdf and density for ess/leapfrog
rmse vs cum. leapfrog steps

any other transforms or plots we want to add? do we want something with ess bulk/tail separately or anything

cc: @sethaxen @spinkney

Transforms for complex manifolds

A few of the manifolds we have (the sphere, symmetric positive definite matrices, and Stiefel manifold) have straightforward complex analogs that appear in statistical applications, and the transforms for these manifolds can be generalized to map to these complex spaces with likely very little additional effort. I suggest we do so for completeness since Stan now has complex number support.

Since I've been working on these manifolds, I would take care of this generalization, but before putting in the effort, I want to confirm if others think this could be within the scope of the paper.

A few notes:

the complex sphere ℂⁿ is equivalent to the real sphere ℝ²ⁿ, so we don't need to do anything special here.
symmetric positive definite matrices generalize to hermitian positive definite matrices. Everything is the same, except eigenvectors are now unitary instead of orthogonal, and the sub-diagonal of the lower Cholesky factor is complex. We could test with the complex Wishart distribution
Stiefel: the semi-orthogonal matrices become semi-unitary. This allows for potentially better transforms. e.g. the Orthogonal group O(N) is a subgroup of unitary U(N), but it has 2 disconnected submanifolds (the ones with +1 and -1 determinant). Thus any transform from unconstrained space to O(N) must use parameter expansion to connect the 2 submanifolds; bijective maps like matrix exponential can only cover one of the submanifolds. However, unitary matrices have only the constraint that the absolute value of the determinant is 1 (i.e. it the determinant lives on the circle in the complex plane), which smoothly connects the submanifolds with +1 and -1 determinant, so parameter expansion is not needed, and bijective maps are possible. Distributions like Matrix von Mises-Fisher and Bingham have complex analogs we could test with.

sum to zero transform

@spinkney proposed a new sum-to-zero transform:

stan-dev/stanc3#1111 (comment)

Given time, we should evaluate this as well as the one we currently use (which sets the last value to the negation of the sum of the first values).

Inverse isometric log ratio simplex method

I can't find where @bob-carpenter mentioned that he was still trying to understand this transform but it piqued my curiosity because I hadn't heard of it.

After grokking the idea I was able to put together an R script that generates a simplex using it. The idea connects unit vector and simplex stuff together. The following is the reverse of the ILR so it's a bit weird. Needing to invert the Helmert matrix - which enforces the the sum-to-zero constraint in the forward mapping.

make_V <- function(D) {
  apply(t(contr.helmert(D)), 1, function(x) x /norm(x))
}

clr_trans <- function(x) {
  D <- length(x)
  log(x) - sum(log(x)) / D
}

random_simplex <- function(D) {

  x_test_ilr <- runif(D - 1, -5, 5)
  # To reverse transform 
  # append a row with all 0s except the last as 1 to make V matrix invertible
  V <- make_V(D)
  V_fullrank <- rbind(t(V), c(rep(0, D - 1), 1))
  V_inv_fullrank <- solve(V_fullrank)
  
  # append 0 on the ILR values
  x_test_ilr_zero <- c(x_test_ilr, 0)
  
  exp_s <- exp(V_inv_fullrank %*% x_test_ilr_zero)
  exp_s
  
  alpha <- sum(exp_s[1:D-1])
  Z <- -log1p(alpha)
  
  # values match x_test
  return(exp_s * exp(Z))
  
}
random_simplex(5)
         [,1]
1 0.020302134
2 0.039406348
3 0.704896030
4 0.005134803
5 0.230260686

I've written a Stan program without the log-det-jac adjustment because I'm tired and I think @sethaxen may be able to whip it up much quicker than me. I put a normal prior just to get sampling going.

data {
 int<lower=0> N;
 matrix[N - 1 , N - 1] Vinv;
}
parameters {
 vector[N - 1] y;
}
transformed parameters {
  simplex[N] z;
  {
    vector[N - 1] s = Vinv * y;
    real alpha = log_sum_exp(append_row(s, 0));
    z = append_row(exp(s - alpha), 1 / exp(alpha));
  }
 
}
model {
 y ~ normal(0, 10);
}
generated quantities {
  real z_sum = sum(z);
}

To run this you need

make_V <- function(D) {
  apply(t(contr.helmert(D)), 1, function(x) x /norm(x))
}

V <- make_V(N)
V_fullrank <- rbind(t(V), c(rep(0, N - 1), 1))
V_inv_fullrank <- solve(V_fullrank)

mod_out <- mod$sample(
  data = list(N = N,
              Vinv = matrix(V_inv_fullrank[1:(N-1), 1:(N-1)], N-1, N-1)),
  parallel_chains = 4,
  iter_sampling = 2000
)
mod_out$summary("z")

and example output of

 mod_out$summary("z")
# A tibble: 5 × 10
  variable  mean    median    sd       mad       q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>     <dbl> <dbl>     <dbl>    <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 z[1]     0.201 0.0000368 0.371 0.0000546 6.36e-14  1.00  1.00    8449.    6023.
2 z[2]     0.206 0.0000382 0.374 0.0000567 3.85e-14  1.00  1.00    9083.    6334.
3 z[3]     0.199 0.0000351 0.369 0.0000521 4.16e-14  1.00  1.00    8053.    6443.
4 z[4]     0.194 0.0000241 0.365 0.0000357 4.58e-14  1.00  1.00    7887.    6253.
5 z[5]     0.201 0.0000500 0.371 0.0000742 3.62e-14  1.00  1.00    7808.    6525.

mathematically defining transforms

The paper's getting a bit out of hand with everyone writing in their own style. I'm trying to work with Meenal now to strip out most of what's there and then build up a draft with coherent notation. As such, the first thing we need to do is define transforms. To me, the goal of the whole business is that when we sample on the unconstrained scale and transform, we get a uniform distribution on the constrained scale. So informally, we want something like this.

Suppose $\mathcal{Y}$ is a domain of unconstrained parameters and $\mathcal{X}$ is a domain of constrained parameters. For now, $\mathcal{Y} = \mathbb{R}^N$ for some $N$ and $\mathcal{X} \subseteq \mathbb{R}^M$ for some $M$. We further suppose a (perhaps improper) density function $p_Y: \mathcal{Y} \rightarrow (0, \infty)$ with support over all elements of $\mathcal{Y}$. We further suppose we have a surjective (onto) map $g:\mathcal{Y} \rightarrow \mathcal{X}$. Informally, we want to ensure that if we have $Y \sim p_Y(\cdot)$, then $g(Y) \sim \textrm{uniform}(\mathcal{X}).$

The reason this definition doesn't work is that $p_Y$ may be improper, so it doesn't really make sense to talk about $Y \sim p_Y(\cdot)$. So instead, I think we want this:

Definition (Constraining Transform): Suppose $\mathcal{Y}$ is a domain of unconstrained parameters and $\mathcal{X}$ is a domain of constrained parameters. Further suppose a (perhaps improper) density function $p_Y: \mathcal{Y} \rightarrow (0, \infty)$ with support over all elements of $\mathcal{Y}$ and a surjective map $g:\mathcal{Y} \rightarrow \mathcal{X}$. The pair $(p_Y, g)$ is a constraining transform from $\mathcal{Y}$ to $\mathcal{X}$ if for any proper density $\pi_X(x)$ defined over $\mathcal{X}$, the density defined by $p_Y(y) = \pi_Y(y) \cdot \pi_X(g(y))$ is proper and if $Y \sim \pi_Y$, then $X = g(Y) \sim \pi_X$.

We can then go on and define the special case of one-one changes of variables.

Lemma (Change of Variables): Suppose $\mathcal{Y} = \mathbb{R}^N$, $\mathcal{X} \subseteq \mathbb{R}^N$, $\pi_X:\mathcal{X} \rightarrow (0, \infty)$ is a proper density, and $f:\mathcal{X} \rightarrow \mathcal{Y}$ is a differentiable function. If we let $\pi_Y(y) = \left| \textrm{det} \frac{\partial}{\partial y} f^{-1}(y) \right|$, then the pair $(\pi_Y, f^{-1})$ is a constraining transform. Proof: From the change of variables formula, $p_Y(y) = p_X(f^{-1}(y)) \left| \textrm{det} \frac{\partial}{\partial y} f^{-1}(y) \right|.$

We can then move on to the example of simplexes, where we are defining a transform from $\mathbb{R}^N$ to $\Delta^N \subseteq \mathbb{R}^{N+1}$ by setting element $N + 1$ to $1$ minus the sum of the first $N$ elements. Another example is going from $\binom{M}{2} + M$ to $M \times M$ dimensions with the Cholesky factor of covariance matrix transform by multiplying by itself transposed.

Writing the paper using Overleaf

Are you guys interested in trying Overleaf for collaborative writing?

I've used the current .tex file to generate this project: https://www.overleaf.com/6115381575pygbmdbkncqn

My experience with Overleaf has been really good so far; I think it can make it easier to ask questions, make small edits etc without making too many PRs / unreviewed edits. What do think?