vincenzocoia / distplyr Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 2.58 MB

Draw powerful insights using distributions with this R package.

Home Page: https://distplyr.netlify.app/

License: Other

R 100.00%

distplyr's People

Contributors

Stargazers

Watchers

Forkers

yelselmiao zhaoshengee

distplyr's Issues

Parameters

One issue that distplyr hasn't considered yet is that of alternative parameterizations of a distribution.

For example, dst_norm() accepts mean and variance as the parameters, but other parameterizations would be useful, such as specifying, say, the 0.05- and 0.95-quantiles (use case: "I'm pretty sure this expense will be between $100 and $200", just set these as lower and upper quantiles). Even mean and stdev would be useful.

I have two ideas for achieving this, but they shouldn't be implemented until after the package is submitted to CRAN.

A parametric distribution is still only specified using its "canonical" parameters (mean and variance for Normal; alpha and beta for Beta; etc.), but changed downstream.

Something like:

dst_norm() %>%
    set_quantiles(0.05 ~ 100, 0.95 ~ 200)

This would operate rlang's tidy evaluation machinery, leaving unevaluated parameters as unknown variables (quosures, probably), until they are evaluated downstream. Bonus: this would also allow for scenarios like dst_unif(a, a + 1), where a is unknown.

Specify parameters by name in the dst_ call.

I'm thinking along the lines of ggplot2's aes() function. The aes() function "parameterizes" a plot according to some aesthetics (see ?geom_linerange for an example of being able to parameterize in more than one way). Except having a special function like aes() might not be useful here, because the parameters can be specified right in the dst_*() call.

So, instead of:

dst_norm <- function(mean, variance) { ... }

we would have something like:

dst_norm <- function(mean, variance, ...)

akin to aes <- function(x, y, ...).

Calls would look something like:

dst_norm(0, 1)  # For mean and variance
dst_norm(0, stdev = 5) # For mean and standard deviation
dst_norm(0.05 ~ 100, 0.95 ~ 200) # For the 0.05- and 0.95-quantiles.

Higher moments of finite distribution throw an error

It looks like the catchall .dst method is being used for finite/empirical distributions for higher moments, and sometimes causes an error:

library(distplyr)
foo <- dst_empirical(c(1, 5, 100))
skewness(foo)
kurtosis_exc(foo)

There should be a special .finite method for skewness and kurtosis.

@toast682 I'm documenting this here as something that's on our collective plate before v0.1.3 is launched. We'll tackle this together sometime soon -- it should only amount to coming up with a couple formulas.

Quantile of a graft distribution

The quantile function of a graft distribution still needs to be coded. Currently, if we try to evaluate quantiles of a graft, it will dispatch the default eval_quantile.dst() method, and will fail:

library(distplyr)
set.seed(1)
a <- dst_norm(0, 1)
x <- realise(a, n = 50)
b <- dst_empirical(x)
g <- graft_right(b, a, breakpoint = 1)
eval_quantile(g, at = 1:9/10)
#> Error in discontinuities(object): could not find function "discontinuities"

^{Created on 2021-09-14 by the reprex package (v0.3.0)}

It fails because the default eval_quantile.dst() isn't quite set up yet, but even if it was set up, we wouldn't want it to be dispatched, because eval_quantile.dst() needs to run an algorithm to invert the cdf to find the quantiles -- not efficient, especially when there's an easy way to calculate quantiles of a graft distribution.

How to calculate the quantile of a graft distribution

If grafting cdf A(x) to the right of cdf B(x) at point x0 (think "B" for "base" distribution), the cdf of the resulting distribution is:

If x < x0, it's simply B(x).
If x > x0, it's A(x) scaled so that it connects to B(x) at x0. So: A(x0) + (B(x) - B(x0)) * (1 - A(x0)) / (1 - B(x0))

This is the formula if x0 is kept in the base distribution B, but not the tail distribution A. If x0 is included in A instead of B, then A(x0) would instead be A(x0^{-}) -- that is, the limit at x0 coming from the left -- and B(x0) would instead be B(x0^{-}).

To find the quantile function, just invert this.

The cdf when grafting on the opposite tail is similar: grafting cdf A(x) to the left of B(x) at point x0, the cdf of the resulting distribution is:

If x < x0, it's A(x) scaled to line up to B(x) at x0. So: A(x) * B(x0) / A(x0)
If x > x0, it's simply B(x).

This is the formula if x0 is kept in the tail distribution A, but not the base distribution B. For the opposite, then as before, A(x0) becomes A(x0^{-}), and B(x0) becomes B(x0^{-}).

Discontinuities

@toast682 discontinuities() should accept a distribution, and return a data frame or tibble of all the points along the x-axis where the cdf has a jump continuity in one column, and the size of the discontinuity in the other.

Suite of parametric distributions

I've deliberately been delaying creating a collection of parametric distributions to be made available in distplyr, so that package evolution/design could be more nimble, and in case we came up with an idea for automating such a thing. This issue elaborates on the latter: somehow automating their creation. We've discussed two ideas so far:

Idea 1: brute force creation of files

The idea is to make one or more R scripts per distribution that's made available through R (like geom, gamma, beta, binom, etc.), containing all the functions needed for each distribution.

Problem: requires brute force and lots more lines of code to test; does not address the issue of allowing a user to add their own parametric distribution.

Ideas for execution:

Make template R files that can be copied for each distribution, with something like FILL_THIS_IN everywhere that needs a manual substitution.
Make a function (not unlike those found in the usethis package) that creates said R files, but with everything filled in.
Perhaps because this option would explode the number of files and functions in distplyr, perhaps store these distributions in a separate package, called something like "distionary".

I'm not digging this option.

Idea 2: make a function that gives access to a distribution

The idea is that, if a user has r, p, q, d functions available for a distribution (say "unif"), then they would be able to access it as a distplyr distribution by doing something like:

d <- dst_parametric("unif", min = 0, max = 1)

It (or related functions) would then do the work of appending the appropriate prefix whenever the user calls representations like eval_cdf(): perhaps eval_cdf.parametric() is dispatched, which grabs the "unif", appends p as a prefix, plugs in the parameters, and evaluates at the at argument.

If we do have this functionality, we should use it as developers, too. It would cut down our code dramatically. For example, we could code up dst_unif like so:

dst_unif <- function(min, max) dst_parametric("unif", min = min, max = max)

There's one problem remaining: there are simple formulas for quantities like the mean, variance, range, etc. We also need to specify if the distribution is continuous or discrete somehow. I'm thinking we can have functions like set_*() to specify these. So, maybe something like this:

User-facing creation of a Uniform distribution (in the hypothetical situation that distplyr::dst_unif does not exist):

d <- dst_parametric("unif", min = 0, max = 1) %>%
    set_mean((min + max) / 2) %>%
    set_range(min, max)
...etc

Developer-facing creation of a Uniform distribution for use as distplyr::dst_unif():

dst_unif <- function(min, max) {
    dst_parametric("unif", min, max) %>%
        set_mean((min + max) / 2) %>%
        set_range(min, max)
}

One big problem with this is that these additional pieces of information would have to be stored with the distribution itself, and we're trying to keep the things stored by a distribution at a minimum. Perhaps storing these things is OK on the user-facing side, but it's not when it comes to the developer-facing side of things.

Where would this information be stored for developers including a distribution like "unif"? Currently, we're storing this info in a verbose way: as specific methods (like mean.unif()), which have the overhead of needing to wrap the expression (like (min + max) / 2) with a bunch of function verbiage. An alternative? Maybe a JSON file, or simply a text file, containing information about these quantities, which could then be read by the specific method: mean.parametric() could look for the appropriate JSON/text file (for, say, "unif") and grab the relevant piece. If we go this route, we wouldn't use the set_*() functions as developers. This "data file" might look something like:

mean: (min + max) / 2
range: min, max
...

This approach is sounding far more robust and trustworthy than Idea 1. It would be great if we could get something to work here.

wishlist

It would be nice to see these functions added to distplyr:

mix(): construct mixture distribution
inflate_zero(): make a zero-inflated distribution; perhaps also the more generic inflate_at()?
condition(): obtain a conditional distribution.
left_connect(), as per right_connect().

Plot function failure for flipped distribution

The plot() function fails for flipped distribution:

library(distplyr)
r <- dst_gpd(1, 1, 1)
l <- -r
plot(r)

try(plot(l))
#> Error : C stack usage  7971168 is too close to the limit
eval_cdf(l, at = -10:10)
#>  [1] 0.1000000 0.1111111 0.1250000 0.1428571 0.1666667 0.2000000 0.2500000
#>  [8] 0.3333333 0.5000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
#> [15] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

^{Created on 2021-08-30 by the reprex package (v0.3.0)}

Slicing on flat part: 0-quantile incorrect

Slicing a distribution on a flat part, so that range() calculation is delegated to eval_quantile(), results in the 0-quantile being incorrect, as below. This is because it's inverting the original CDF to the left of the flat part, no longer to the right (which we do only for the 0-quantile)

library(distplyr)
d <- mix(dst_unif(0, 1), dst_empirical(c(-1, 0.5, 2)))
plot(d, "cdf", from = -2, to = 3, n = 1000)

slice_left(d, 1.5) %>% 
    eval_quantile(0)
#> [1] 1

^{Created on 2023-02-05 with reprex v2.0.2}

v0.1.3 remaining

@toast682 I refined the to-do list we were looking at today. Here's what's relevant:

Poisson distribution:

@toast682 Use pois instead of poisson to match with base R.
@toast682 Tests for the Poisson distribution.

Baseline functions:

@toast682 Throw an error when trying to evaluate pmf of a non-discrete random variable, and density of non-continuous. Use the variable() function. This should be for the eval_pmf.dst() and eval_density.dst() methods.
@vincenzocoia Fix the quantile function algorithm (as computed from the cdf):
- First, check where the solution lies with respect to discontinuities.
- Second, if the inverse does not fall directly on a discontinuity, implement a Newton-Raphson algorithm to find the inverse.

Graft distributions:

Need to reprogram change their definition via graft_left() and graft_right().
Then, need to change how .graft() methods operate (or ensure the existing ones work).

More insight about graft_left() and graft_right():

Perhaps class should be: c("graft", "dst"). The object should store (1) the two distributions, (2) the cutoff point at which the distribution will be grafted, and (3) some indication as to whether a distribution has been grafted to the left or the right (or perhaps allow for both possibilities to be stored). As with mix(), grafting two finite distributions should result in a finite distribution. But, unlike mix(), if a graft distribution is further grafted, there is no simplification that can happen around the grafting of the original distributions.

What is a graft distribution?

Imagine taking a cdf, and cutting off its right tail (after some specified point). Replace the cut-out section with another cdf (cdf2), just rescaling cdf2 vertically so that it connects with the original cdf. (We only use cdf2 beyond the cutoff point). This is a right-graft; similar concept applies to a left-graft. Useful for extreme value models.

range of sliced distributions not accurate

Slicing in between two discrete points (and where the cdf is flat) is inaccurate, for both slice_left() and slice_right() distributions. In the following example (of finding the range of a slice_left() distribution), the range should be c(4.5, 10):

library(distplyr)
library(magrittr)
dst_empirical(1:10) %>% 
    slice_left(4.5) %>% 
    range()
#> [1]  5 10

^{Created on 2021-10-16 by the reprex package (v2.0.1)}

EDIT: The above output is actually correct; the issue was if the CDF ramped up after the flat part. Fixed now.

Quantile algorithm

When the quantile function isn't available, it needs to be evaluated by inverting the cdf. This requires an algorithm.

I've tried using uniroot() in other contexts, and anecdotally, I find that it's not as reliable as writing our own Newton Raphson algorithm.
The current algorithm can be found in the helpers-quantile.R file, but it doesn't work: it relies on a deprecated (and removed) function called discontinuities(). In place of discontinuities(), we now have next_discrete() and prev_discrete(), which finds the next/previous discrete value in a distribution relative to some reference point. These will need to be used in the quantile algorithm.
An example of an algorithm can be found in the igcop package (the version prior to incorporating C++ code), in the function igl_gen_inv_algo(). It is the inverse of a function called igl_gen() (which is a decreasing function starting at y=1 when x=0, to y=0 as x->infinity -- so, it's a survival function, which makes it relevant to our situation).

Three additional complications/additions that arise in our case compared to the igcop case above:

Sometimes there are step discontinuities in the cdf. We should be able to check whether the quantile level occurs within a step discontinuity. So, I imagine two overall steps to evaluating the quantile: (1) first checking whether the quantile level falls within a discontinuity; if so, return that x value. (2) if it does not fall in a discontinuity, then run the Newton Raphson algorithm.
We must anticipate flat spots in the cdf. Newton Raphson will return any value along the flat spot, but we want the leftmost value. The current algorithm in the helpers-quantile.R file does have a method that deals with this -- probably best to use this approach.
It's not obvious what a starting value should be. Maybe we can start with the mean -- somewhere central to the distribution. But will this always be available? If not, perhaps a call to next_discrete(). Basically, we need to know some x value that's "within" the distribution.

Double graft: environment issue

Something bizarre happens when you try to graft both tails, one following the other. Here's a reprex to debug, although it requires adding a line -- print(breakpoint) -- at the top of one of the default slice methods. In this case, I've added it to slice_right.dst().

library(distplyr)
#> ℹ Loading distplyr
d <- dst_norm(0, 1)
dl <- d %>% graft_left(d, breakpoint = -1)
#> [1] -1
dlr <- dl %>% graft_right(d, breakpoint = 1)
#> [1] -1
plot(dlr, "cdf", from = -3, to = 3)

invisible(dl %>% slice_right(breakpoint = 1))
#> [1] -1
invisible(dl %>% slice_right.dst(breakpoint = 1))
#> [1] 1
dl$components
#> $distributions
#> $distributions[[1]]
#> slice_right dst
#> 
#>  distribution :
#> norm parametric dst
#> 
#>  parameters :
#> $mean
#> [1] 0
#> 
#> $variance
#> [1] 1
#> 
#> 
#> $distributions[[2]]
#> slice_left dst
#> 
#>  distribution :
#> norm parametric dst
#> 
#>  parameters :
#> $mean
#> [1] 0
#> 
#> $variance
#> [1] 1
#> 
#> 
#> 
#> $probs
#> [1] 0.1586553 0.8413447
#> 
#> $breakpoint
#> [1] -1

^{Created on 2021-08-30 by the reprex package (v0.3.0)}

The issue is pinpointed in the two lines with invisible: calling the method directly yields the appropriate breakpoint value, whereas going through the generic yields the breakpoint from the previous graft. Somehow, the scoping rules are favouring the breakpoint entry that's being stored in the dl graft distribution (see the output of the last line), as opposed to the one directly input as a function argument.

Bug with Pareto distribution and eval quantile

library(tidyverse)
library(distplyr)
dst_pareto <- local({
  ppareto <- function(q, xi, lower.tail = TRUE) {
    res <- pmin(1, q ^ (-1 / xi))
    if (lower.tail) res <- 1 - res
    res
  }
  dpareto <- function(q, xi) sapply(q, function(q_) if (q_ >= 1) q_ ^ (-1 / xi - 1) / xi else 0)
  qpareto <- function(p, xi) (1 - p) ^ (-xi)
  function(xi) dst_parametric("pareto", xi = xi, .variable = "continuous")
})
eval_quantile(maximise(dst_pareto(1), dst_pareto(2) + 5), at = 0.5)

Error in while (cdf_gt(cdf_left, p_min)) { :
missing value where TRUE/FALSE needed

Mixed variables should have more info stored

If the random variable of a distribution is mixed, it would be useful to indicate "where" it is continuous, and "where" it is discrete. This would at least be useful for a graft distribution, because it's unclear what variable type results when grafting a distribution of a mixed variable.

EVI should distinguish between left and right tail

Need to modify the get_evi generic and methods to distinguish between the extreme value index of the left tail and the right tail.

Package bifurcation

distplyr, in its current form, is too big. I suggest separating out a distionary package for defining base distribution forms (parametric, finite, and others), and leaving distplyr for distribution manipulation.

The bifurcation would be similar to the tibble & dplyr packages:

the tibble package defines tibbles, and dplyr manipulates tibbles. Similarly,
the distionary package defines distributions, and the distplyr package manipulates distributions.

Why the name "distionary"? Because the package will be like a dictionary of specific distributions.

The distionary package will be similar to the distributional package, with key differences:

distributional provides a vectorized framework for distributions. distionary prefers leaving vectorization up to the user, because evaluating a distribution can result in different types of output (from a tibble to a single character vector to a numeric vector).
distributional has some multivariate distribution families. distionary distributions will only be univariate, eventually delegating multivariate distributions to another class to be evaluated in different ways.
distionary allows for users to create their own parametric distribution.
distionary will eventually allow for the creation of a distribution from a cumulative hazard function from a proportional hazards model, and a "cumulative odds function" (if I can call it that) from a proportional odds model.

pmf of mixture model

Evaluating the pmf of a mixture model results in an incorrect evaluation, and an incorrect output length. Here's a reprex of mixing two Poisson distributions, but the same issue occurs with other mixtures.

library(distplyr)
library(magrittr)
mix(dst_pois(1), dst_pois(2)) %>% 
    eval_pmf(at = 1:10)
#> [1] 0

^{Created on 2021-09-14 by the reprex package (v0.3.0)}

Error evaluating `invert()`

library(distplyr)
#> Loading required package: distionary
d <- dst_gamma(1, 1)
d2 <- invert(d)
eval_survival(d2, at = 2)
#> Error in eval_pmf.parametric(dist, at = 1/at): Distribution is of variable type 'continuous'; pmf only exists for discrete variables. Perhaps you'd like to evaluate outside of strict mode?

^{Created on 2022-04-01 by the reprex package (v2.0.1)}

Slicing off a discrete point does not result in continuous distribution

library(distplyr)
#> Loading required package: distionary
mix(dst_exp(1), dst_degenerate(0)) %>% 
    slice_left(0, include = TRUE) %>% 
    plot()
#> Warning in plot.dst(.): Density function does not exist. Plotting cdf instead.
#> Error in eval_density(d, at = breakpoint, strict = FALSE): unused argument (strict = FALSE)

The distribution should be Exp(1), and is continuous.

Problems with new, undeveloped "parametric" class

Notes about an error from @yelselmiao :

d1 <- dst_empirical(1:10)
d2 <- dst_unif(5, 10)
d3 <- dst_unif(0, 5)

plot(d1)
plot(d2, add = TRUE, col= 'red')
plot(d3, add = TRUE, col = 'blue')


g1 <- graft_right(d1, d2, breakpoint = 5, include = FALSE)
g1$components


# breakpoint retained
g2 <- graft_right(d1, d2, breakpoint = 5, include = TRUE)
g2$components

# cannot use plot for g1 and g2
eval_cdf(g1, at = 5)
eval_cdf(g2, at = 5) # it's NULL here, haven't figured why

#enframe_cdf(g1, at = seq(0, 10, 1)) 



# Error in get(fun, mode = "function", envir = envir) : object 'p' of mode 'function' was not found
g3 <- graft_left(d3, d1, breakpoint = 5, include = FALSE)

g4 <- graft_left(d3, d1, breakpoint = 5, include = TRUE)

Also, with the oncoming new parametric class, parametric distribution like Unif(0, 1) are not evaluating properly. For example:

eval_survival(dst_unif(0, 1), at = 0.4)

throws an error.

Temporary fix: make a temporary function until the "parametric" class is settled:

eval_survival.unif <- function(distribution, at) {
     1 - punif(at, min = distribution$parameters$min, max = distribution$parameters$max)
}

Issue with Slice Function, included point not being evaluated correctly when checking pmf

library(distplyr)
library(testthat)
d1 <- dst_norm(0, 1)
d2 <- dst_empirical(0:4)
d <- mix(d1, d2)
dl_not <- slice_left(d, 0, include = FALSE)
dr_not <- slice_right(d, 0, include = FALSE)
expect_equal(eval_pmf(dl_not, at = 0, strict = FALSE), 0.1)
#> Error: eval_pmf(dl_not, at = 0, strict = FALSE) not equal to 0.1.
#> 1/1 mismatches
#> [1] 0 - 0.1 == -0.1
expect_equal(eval_pmf(dr_not, at = 0, strict = FALSE), 1 / 6)
#> Error: eval_pmf(dr_not, at = 0, strict = FALSE) not equal to 1/6.
#> 1/1 mismatches
#> [1] 0 - 0.167 == -0.167