vincenzocoia / distplyr Goto Github PK
View Code? Open in Web Editor NEWDraw powerful insights using distributions with this R package.
Home Page: https://distplyr.netlify.app/
License: Other
Draw powerful insights using distributions with this R package.
Home Page: https://distplyr.netlify.app/
License: Other
One issue that distplyr hasn't considered yet is that of alternative parameterizations of a distribution.
For example, dst_norm()
accepts mean
and variance
as the parameters, but other parameterizations would be useful, such as specifying, say, the 0.05- and 0.95-quantiles (use case: "I'm pretty sure this expense will be between $100 and $200", just set these as lower and upper quantiles). Even mean
and stdev
would be useful.
I have two ideas for achieving this, but they shouldn't be implemented until after the package is submitted to CRAN.
Something like:
dst_norm() %>%
set_quantiles(0.05 ~ 100, 0.95 ~ 200)
This would operate rlang's tidy evaluation machinery, leaving unevaluated parameters as unknown variables (quosures, probably), until they are evaluated downstream. Bonus: this would also allow for scenarios like dst_unif(a, a + 1)
, where a
is unknown.
dst_
call.I'm thinking along the lines of ggplot2's aes()
function. The aes()
function "parameterizes" a plot according to some aesthetics (see ?geom_linerange
for an example of being able to parameterize in more than one way). Except having a special function like aes()
might not be useful here, because the parameters can be specified right in the dst_*()
call.
So, instead of:
dst_norm <- function(mean, variance) { ... }
we would have something like:
dst_norm <- function(mean, variance, ...)
akin to aes <- function(x, y, ...)
.
Calls would look something like:
dst_norm(0, 1) # For mean and variance
dst_norm(0, stdev = 5) # For mean and standard deviation
dst_norm(0.05 ~ 100, 0.95 ~ 200) # For the 0.05- and 0.95-quantiles.
It looks like the catchall .dst
method is being used for finite/empirical distributions for higher moments, and sometimes causes an error:
library(distplyr)
foo <- dst_empirical(c(1, 5, 100))
skewness(foo)
kurtosis_exc(foo)
There should be a special .finite
method for skewness and kurtosis.
@toast682 I'm documenting this here as something that's on our collective plate before v0.1.3 is launched. We'll tackle this together sometime soon -- it should only amount to coming up with a couple formulas.
The quantile function of a graft distribution still needs to be coded. Currently, if we try to evaluate quantiles of a graft, it will dispatch the default eval_quantile.dst()
method, and will fail:
library(distplyr)
set.seed(1)
a <- dst_norm(0, 1)
x <- realise(a, n = 50)
b <- dst_empirical(x)
g <- graft_right(b, a, breakpoint = 1)
eval_quantile(g, at = 1:9/10)
#> Error in discontinuities(object): could not find function "discontinuities"
Created on 2021-09-14 by the reprex package (v0.3.0)
It fails because the default eval_quantile.dst()
isn't quite set up yet, but even if it was set up, we wouldn't want it to be dispatched, because eval_quantile.dst()
needs to run an algorithm to invert the cdf to find the quantiles -- not efficient, especially when there's an easy way to calculate quantiles of a graft distribution.
If grafting cdf A(x) to the right of cdf B(x) at point x0 (think "B" for "base" distribution), the cdf of the resulting distribution is:
This is the formula if x0 is kept in the base distribution B, but not the tail distribution A. If x0 is included in A instead of B, then A(x0) would instead be A(x0^{-}) -- that is, the limit at x0 coming from the left -- and B(x0) would instead be B(x0^{-}).
To find the quantile function, just invert this.
The cdf when grafting on the opposite tail is similar: grafting cdf A(x) to the left of B(x) at point x0, the cdf of the resulting distribution is:
This is the formula if x0 is kept in the tail distribution A, but not the base distribution B. For the opposite, then as before, A(x0) becomes A(x0^{-}), and B(x0) becomes B(x0^{-}).
@toast682 discontinuities()
should accept a distribution, and return a data frame or tibble of all the points along the x-axis where the cdf has a jump continuity in one column, and the size of the discontinuity in the other.
I've deliberately been delaying creating a collection of parametric distributions to be made available in distplyr, so that package evolution/design could be more nimble, and in case we came up with an idea for automating such a thing. This issue elaborates on the latter: somehow automating their creation. We've discussed two ideas so far:
The idea is to make one or more R scripts per distribution that's made available through R (like geom
, gamma
, beta
, binom
, etc.), containing all the functions needed for each distribution.
Problem: requires brute force and lots more lines of code to test; does not address the issue of allowing a user to add their own parametric distribution.
Ideas for execution:
FILL_THIS_IN
everywhere that needs a manual substitution.I'm not digging this option.
The idea is that, if a user has r
, p
, q
, d
functions available for a distribution (say "unif"), then they would be able to access it as a distplyr distribution by doing something like:
d <- dst_parametric("unif", min = 0, max = 1)
It (or related functions) would then do the work of appending the appropriate prefix whenever the user calls representations like eval_cdf()
: perhaps eval_cdf.parametric()
is dispatched, which grabs the "unif", appends p
as a prefix, plugs in the parameters, and evaluates at the at
argument.
If we do have this functionality, we should use it as developers, too. It would cut down our code dramatically. For example, we could code up dst_unif
like so:
dst_unif <- function(min, max) dst_parametric("unif", min = min, max = max)
There's one problem remaining: there are simple formulas for quantities like the mean, variance, range, etc. We also need to specify if the distribution is continuous or discrete somehow. I'm thinking we can have functions like set_*()
to specify these. So, maybe something like this:
distplyr::dst_unif
does not exist):d <- dst_parametric("unif", min = 0, max = 1) %>%
set_mean((min + max) / 2) %>%
set_range(min, max)
...etc
distplyr::dst_unif()
:dst_unif <- function(min, max) {
dst_parametric("unif", min, max) %>%
set_mean((min + max) / 2) %>%
set_range(min, max)
}
One big problem with this is that these additional pieces of information would have to be stored with the distribution itself, and we're trying to keep the things stored by a distribution at a minimum. Perhaps storing these things is OK on the user-facing side, but it's not when it comes to the developer-facing side of things.
Where would this information be stored for developers including a distribution like "unif"? Currently, we're storing this info in a verbose way: as specific methods (like mean.unif()
), which have the overhead of needing to wrap the expression (like (min + max) / 2
) with a bunch of function verbiage. An alternative? Maybe a JSON file, or simply a text file, containing information about these quantities, which could then be read by the specific method: mean.parametric()
could look for the appropriate JSON/text file (for, say, "unif") and grab the relevant piece. If we go this route, we wouldn't use the set_*()
functions as developers. This "data file" might look something like:
mean: (min + max) / 2
range: min, max
...
This approach is sounding far more robust and trustworthy than Idea 1. It would be great if we could get something to work here.
It would be nice to see these functions added to distplyr
:
mix()
: construct mixture distributioninflate_zero()
: make a zero-inflated distribution; perhaps also the more generic inflate_at()
?condition()
: obtain a conditional distribution.left_connect()
, as per right_connect()
.The plot()
function fails for flipped distribution:
library(distplyr)
r <- dst_gpd(1, 1, 1)
l <- -r
plot(r)
try(plot(l))
#> Error : C stack usage 7971168 is too close to the limit
eval_cdf(l, at = -10:10)
#> [1] 0.1000000 0.1111111 0.1250000 0.1428571 0.1666667 0.2000000 0.2500000
#> [8] 0.3333333 0.5000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
#> [15] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
Created on 2021-08-30 by the reprex package (v0.3.0)
Slicing a distribution on a flat part, so that range()
calculation is delegated to eval_quantile()
, results in the 0-quantile being incorrect, as below. This is because it's inverting the original CDF to the left of the flat part, no longer to the right (which we do only for the 0-quantile)
library(distplyr)
d <- mix(dst_unif(0, 1), dst_empirical(c(-1, 0.5, 2)))
plot(d, "cdf", from = -2, to = 3, n = 1000)
slice_left(d, 1.5) %>%
eval_quantile(0)
#> [1] 1
Created on 2023-02-05 with reprex v2.0.2
@toast682 I refined the to-do list we were looking at today. Here's what's relevant:
pois
instead of poisson
to match with base R.variable()
function. This should be for the eval_pmf.dst()
and eval_density.dst()
methods.graft_left()
and graft_right()
..graft()
methods operate (or ensure the existing ones work).More insight about graft_left()
and graft_right()
:
Perhaps class should be: c("graft", "dst")
. The object should store (1) the two distributions, (2) the cutoff point at which the distribution will be grafted, and (3) some indication as to whether a distribution has been grafted to the left or the right (or perhaps allow for both possibilities to be stored). As with mix()
, grafting two finite distributions should result in a finite distribution. But, unlike mix()
, if a graft distribution is further grafted, there is no simplification that can happen around the grafting of the original distributions.
What is a graft distribution?
Imagine taking a cdf, and cutting off its right tail (after some specified point). Replace the cut-out section with another cdf (cdf2), just rescaling cdf2 vertically so that it connects with the original cdf. (We only use cdf2 beyond the cutoff point). This is a right-graft; similar concept applies to a left-graft. Useful for extreme value models.
Slicing in between two discrete points (and where the cdf is flat) is inaccurate, for both slice_left()
and slice_right()
distributions. In the following example (of finding the range of a slice_left()
distribution), the range should be c(4.5, 10)
:
library(distplyr)
library(magrittr)
dst_empirical(1:10) %>%
slice_left(4.5) %>%
range()
#> [1] 5 10
Created on 2021-10-16 by the reprex package (v2.0.1)
EDIT: The above output is actually correct; the issue was if the CDF ramped up after the flat part. Fixed now.
When the quantile function isn't available, it needs to be evaluated by inverting the cdf. This requires an algorithm.
uniroot()
in other contexts, and anecdotally, I find that it's not as reliable as writing our own Newton Raphson algorithm.discontinuities()
. In place of discontinuities()
, we now have next_discrete()
and prev_discrete()
, which finds the next/previous discrete value in a distribution relative to some reference point. These will need to be used in the quantile algorithm.igl_gen_inv_algo()
. It is the inverse of a function called igl_gen()
(which is a decreasing function starting at y=1 when x=0, to y=0 as x->infinity -- so, it's a survival function, which makes it relevant to our situation).Three additional complications/additions that arise in our case compared to the igcop case above:
helpers-quantile.R
file does have a method that deals with this -- probably best to use this approach.next_discrete()
. Basically, we need to know some x value that's "within" the distribution.Something bizarre happens when you try to graft both tails, one following the other. Here's a reprex to debug, although it requires adding a line -- print(breakpoint)
-- at the top of one of the default slice methods. In this case, I've added it to slice_right.dst()
.
library(distplyr)
#> ℹ Loading distplyr
d <- dst_norm(0, 1)
dl <- d %>% graft_left(d, breakpoint = -1)
#> [1] -1
dlr <- dl %>% graft_right(d, breakpoint = 1)
#> [1] -1
plot(dlr, "cdf", from = -3, to = 3)
invisible(dl %>% slice_right(breakpoint = 1))
#> [1] -1
invisible(dl %>% slice_right.dst(breakpoint = 1))
#> [1] 1
dl$components
#> $distributions
#> $distributions[[1]]
#> slice_right dst
#>
#> distribution :
#> norm parametric dst
#>
#> parameters :
#> $mean
#> [1] 0
#>
#> $variance
#> [1] 1
#>
#>
#> $distributions[[2]]
#> slice_left dst
#>
#> distribution :
#> norm parametric dst
#>
#> parameters :
#> $mean
#> [1] 0
#>
#> $variance
#> [1] 1
#>
#>
#>
#> $probs
#> [1] 0.1586553 0.8413447
#>
#> $breakpoint
#> [1] -1
Created on 2021-08-30 by the reprex package (v0.3.0)
The issue is pinpointed in the two lines with invisible
: calling the method directly yields the appropriate breakpoint value, whereas going through the generic yields the breakpoint from the previous graft. Somehow, the scoping rules are favouring the breakpoint
entry that's being stored in the dl
graft distribution (see the output of the last line), as opposed to the one directly input as a function argument.
library(tidyverse)
library(distplyr)
dst_pareto <- local({
ppareto <- function(q, xi, lower.tail = TRUE) {
res <- pmin(1, q ^ (-1 / xi))
if (lower.tail) res <- 1 - res
res
}
dpareto <- function(q, xi) sapply(q, function(q_) if (q_ >= 1) q_ ^ (-1 / xi - 1) / xi else 0)
qpareto <- function(p, xi) (1 - p) ^ (-xi)
function(xi) dst_parametric("pareto", xi = xi, .variable = "continuous")
})
eval_quantile(maximise(dst_pareto(1), dst_pareto(2) + 5), at = 0.5)
Error in while (cdf_gt(cdf_left, p_min)) { :
missing value where TRUE/FALSE needed
If the random variable of a distribution is mixed, it would be useful to indicate "where" it is continuous, and "where" it is discrete. This would at least be useful for a graft distribution, because it's unclear what variable type results when grafting a distribution of a mixed variable.
Need to modify the get_evi
generic and methods to distinguish between the extreme value index of the left tail and the right tail.
distplyr, in its current form, is too big. I suggest separating out a distionary
package for defining base distribution forms (parametric, finite, and others), and leaving distplyr for distribution manipulation.
The bifurcation would be similar to the tibble & dplyr packages:
Why the name "distionary"? Because the package will be like a dictionary of specific distributions.
The distionary package will be similar to the distributional package, with key differences:
Evaluating the pmf of a mixture model results in an incorrect evaluation, and an incorrect output length. Here's a reprex of mixing two Poisson distributions, but the same issue occurs with other mixtures.
library(distplyr)
library(magrittr)
mix(dst_pois(1), dst_pois(2)) %>%
eval_pmf(at = 1:10)
#> [1] 0
Created on 2021-09-14 by the reprex package (v0.3.0)
library(distplyr)
#> Loading required package: distionary
d <- dst_gamma(1, 1)
d2 <- invert(d)
eval_survival(d2, at = 2)
#> Error in eval_pmf.parametric(dist, at = 1/at): Distribution is of variable type 'continuous'; pmf only exists for discrete variables. Perhaps you'd like to evaluate outside of strict mode?
Created on 2022-04-01 by the reprex package (v2.0.1)
library(distplyr)
#> Loading required package: distionary
mix(dst_exp(1), dst_degenerate(0)) %>%
slice_left(0, include = TRUE) %>%
plot()
#> Warning in plot.dst(.): Density function does not exist. Plotting cdf instead.
#> Error in eval_density(d, at = breakpoint, strict = FALSE): unused argument (strict = FALSE)
The distribution should be Exp(1), and is continuous.
Notes about an error from @yelselmiao :
d1 <- dst_empirical(1:10)
d2 <- dst_unif(5, 10)
d3 <- dst_unif(0, 5)
plot(d1)
plot(d2, add = TRUE, col= 'red')
plot(d3, add = TRUE, col = 'blue')
g1 <- graft_right(d1, d2, breakpoint = 5, include = FALSE)
g1$components
# breakpoint retained
g2 <- graft_right(d1, d2, breakpoint = 5, include = TRUE)
g2$components
# cannot use plot for g1 and g2
eval_cdf(g1, at = 5)
eval_cdf(g2, at = 5) # it's NULL here, haven't figured why
#enframe_cdf(g1, at = seq(0, 10, 1))
# Error in get(fun, mode = "function", envir = envir) : object 'p' of mode 'function' was not found
g3 <- graft_left(d3, d1, breakpoint = 5, include = FALSE)
g4 <- graft_left(d3, d1, breakpoint = 5, include = TRUE)
Also, with the oncoming new parametric
class, parametric distribution like Unif(0, 1) are not evaluating properly. For example:
eval_survival(dst_unif(0, 1), at = 0.4)
throws an error.
Temporary fix: make a temporary function until the "parametric" class is settled:
eval_survival.unif <- function(distribution, at) {
1 - punif(at, min = distribution$parameters$min, max = distribution$parameters$max)
}
library(distplyr)
library(testthat)
d1 <- dst_norm(0, 1)
d2 <- dst_empirical(0:4)
d <- mix(d1, d2)
dl_not <- slice_left(d, 0, include = FALSE)
dr_not <- slice_right(d, 0, include = FALSE)
expect_equal(eval_pmf(dl_not, at = 0, strict = FALSE), 0.1)
#> Error: eval_pmf(dl_not, at = 0, strict = FALSE) not equal to 0.1.
#> 1/1 mismatches
#> [1] 0 - 0.1 == -0.1
expect_equal(eval_pmf(dr_not, at = 0, strict = FALSE), 1 / 6)
#> Error: eval_pmf(dr_not, at = 0, strict = FALSE) not equal to 1/6.
#> 1/1 mismatches
#> [1] 0 - 0.167 == -0.167
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.