jorenham / lmo Goto Github PK

Trimmed L-moments and L-comoments for robust statistics.

Home Page: https://jorenham.github.io/Lmo/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

l-moments numpy python statistics robust-optimization robust-statistics fitting-distribution multivariate-statistics probability-statistics l-comoments

lmo's Introduction

Lmo - Trimmed L-moments and L-comoments

Unlike the legacy product-moments, the L-moments uniquely describe a probability distribution, and are more robust and efficient.

The "L" stands for Linear; it is a linear combination of order statistics. So Lmo is as fast as sorting your samples (in terms of time-complexity).

Key Features

Calculates trimmed L-moments and L-comoments, from samples or any scipy.stats distribution.
Full support for trimmed L-moment (TL-moments), e.g. lmo.l_moment(..., trim=(1/137, 3.1416)).
Generalized Method of L-moments: robust distribution fitting that beats MLE.
Fast estimation of L-comoment matrices from your multidimensional data or multivariate distribution.
Goodness-of-fit test, using L-moment or L-moment ratio's.
Exact (co)variance structure of the sample- and population L-moments.
Theoretical & empirical influence functions of L-moments & L-ratio's.
Complete docs, including detailed API reference with usage examples and with mathematical $\TeX$ definitions.
Clean Pythonic syntax for ease of use.
Vectorized functions for very fast fitting.
Fully typed, tested, and tickled.
Optional Pandas integration.

Quick example

Even if your data is pathological like Cauchy, and the L-moments are not defined, the trimmed L-moments (TL-moments) can be used instead. Let's calculate the TL-location and TL-scale of a small amount of samples:

>>> import numpy as np
>>> import lmo
>>> rng = np.random.default_rng(1980)
>>> x = rng.standard_cauchy(96)  # pickle me, Lmo
>>> lmo.l_moment(x, [1, 2], trim=(1, 1)).
array([-0.17937038,  0.68287665])

Now compare with the theoretical standard Cauchy TL-moments:

>>> from scipy.stats import cauchy
>>> cauchy.l_moment([1, 2], trim=(1, 1))
array([0.        , 0.69782723])

See the documentation for more examples and the API reference.

Roadmap

Automatic trim-length selection.
Plotting utilities (deps optional), e.g. for L-moment ratio diagrams.

Installation

Lmo is on PyPI, so you can do something like:

pip install lmo

Required dependencies

These are automatically installed by your package manager, alongside lmo.

Package	Minimum version
Python	`3.10`
NumPy	See NEP 29
SciPy	`1.9`

Optional dependencies

Package	Minimum version	Notes
Pandas	`1.4`	Installable as `lmo[pandas]`

Foundational Literature

lmo's People

Contributors

Stargazers

Watchers

Forkers

wolph

lmo's Issues

Scalar trim alias missing in `typing.AnyTrim`

Empirical influence functions

Analogous to the lmo.theoretical influence functions from #22, add the following methods:

lmo.l_moment_influence(x: array_like[number], r: int, *, **) -> (T) -> T
lmo.l_ratio_influence(x: array_like[number], r: int, k: int = 2, *, **) -> (T) -> T

Option to skip sorting with `lmo.l_{}(..., sorted=True)`

This can sneakily be exploited to use e.g. lmo.l_moment for the calculation of pairwise L-comoments with e.g. lmo.l_moment(x[np.argsort(y)], ..., sorted=True).

Optional xarray support

See https://docs.xarray.dev/en/stable/internals/extending-xarray.html

Support multivariate `scipy.stats` continuous random vectors

Extend the scipy.stats joint/multivariate distributions with L-comoment (ratio) methods, using lmo.theoretical.l_comoment_from_pdf and lmo.theoretical.l_coratio_from_pdf.

This requires the joint PDF, and the marginal CDF's.
Unfortunately, the marginals are nowhere to be found in scipy.stats._multivariate.multi_rv_generic or its subtypes.

So the only way to implement this feature, is by figuring out the marginals of each joint-distribution, and manually add the l_co(moment|ratio|stats|loc|scale|rr|skew|kurtosis) methods to their respective scipy.stats._multivariate.{}_(gen|frozen)
types.

multivariate_normal (norm marginals)
dirichlet (beta marginals)
multivariate_t (t marginals)

The 13 other ones are either discrete, matrix-valued, directional, or have no .pdf() method, and therefore out-of-scope.

Population (T)L-moments from CDF and PPF

A `nan_policy` kwarg for the sample estimators

Currently, data with nan's are either always propagated, or an error is raised (e.g. np.asarray_chkfinite raises a ValueError).
The behaviour around nan's is currently unspecified and untested.

Luckily, SciPy has their very well-specified nan_policy (https://scipy.github.io/devdocs/dev/api-dev/nan_policy.html) which appears to be very much applicable to (at least) Lmo's sample estimators.

`scipy.stats.distributions` integration

add methods l_moment, l_stats, etc. to the "generic" and "frozen" distn bases (monkeypatch)
- rv_continuous (depends #6), and use #5 + the name attribute to check if specific theoretical L-moments are known
- rv_discrete (depends #9)
for the theoretically-known distributions from #5, override l_moment on the specific rv_continuous instances

Fractional trimming in sample L-moment

Modify lmo.l_weights to accept trim: tuple[float, float], by replacing the succession-matrix algorithm with a direct approach, so that gamma functions can be used (instead of comb).

This is already possible for population L-moments.

Implement the 4-parameter generalized lambda distribution

https://wikipedia.org/wiki/Tukey_lambda_distribution
https://reference.wolfram.com/language/ref/TukeyLambdaDistribution.html

Support `__array_ufunc__` in `lmo.l_(loc|scale|variation|skew|kurtosis)`

Currently, these always return a np.ndarray, even when the input is an e.g. pd.Series.

By making these functions aware of the potential __array_ufunc__ or __array_function__ methods, it can automatically "lift" itself, so that the types of ufunc-aware instances can pass-though.

It's probably best to use np.frompyfunc for this.

Note that this will require additional @overload's.

improve the `import lmo` runtime

Most of the startup time comes from numpy and scipy imports.

In some cases scipy imports can be replaced with stdlib math functions.
In other cases it sometimes to make sense to move the top-level imports into the relevant functions.

Implement the Wakeby distribution

https://wikipedia.org/wiki/Wakeby_distribution

Implement the Pareto type IV distribution

Allow passing `scipy.stats` univariate distributions to the `lmo.l_*` functions

The current monkey-patched methods have several issues:

type annotations are impossible for extensions methods, and I've seen no attempts at a PEP that plans to add the required functionality
relies on undocumented scipy.stats.rv_* internals, which can break without warning
not guaranteed to work when using multiprocessing (it might be fine when using fork, but with e.g. spawn or joblib+loky I suspect it could break, although I have yet to test this).
difficult to document properly within mkdocs (hacks required), especially when considering that the docstring styles of scipy and lmo are incompatible.
negatively affects the performance of import lmo (the first time)

Instead, it'd be better to extend the lmo.l_* moment functions, making them accept the rv_ instances directly.

implement for discrete RV's
implement for continuous RV's
document the usage, and add (doctest) examples
write (hypothesis) tests for the discrete/continuous RV's in lmo.distributions and scipy.stats
use two sepearate TypedDict's for **kwargs for specifying extra options to the underlying sample- and population- L-moments estimators
add appropriate @overload's.
remove lmo.contrib.scipy_stats.l_rv_generic and the related monkey-patch machinery

Missing `@overload` for L-moment aliases

E.g. lmo.l_loc(x, axis=None) always returns a scalar T, but the type checker cannot narrow the return type any further than npt.NDArray[T] | T.

Optional polars support

Use the Polars extension API to register namespaces, analogous to those for Pandas

related to #184

Arbitrary concomitants in the prescence of duplicates

This can be solved with either np.lexsort or np.argsort + structured type's. So a small performance test is requires to select the best of the two.

Alternatively, a post-sort check can quickly be done to identify (consecutive) duplicate values. These sub-array's can then individually be re-ordered, using the concomitant's ordering. But I suspect this to be a lot slower that the other two options.

Implement the known theoretical distribution L-moments

I.e. the L-moments that are listed here:
https://jorenham.github.io/Lmo/distributions/

Asymptotic covariance

Asymptotic covariance is defined as

$$\mathbb{AV}_F[T_i, T_j] = \mathbb{E}_F[\psi_{T_i | F}(X) \; \psi_{T_j | F}(X)]\; ,$$

with statistical functional $T_r$ either an L-moment or L-ratio (theoretical or sample estimate), $\psi_{T_j | F}(X)$ its influence function, and $F(x)$ the CDF or ECDF.

parametric
nonparametric

`scipy.stats.wald.l_moment` bug

>>> scipy.stats.wald.l_moment(1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-bb1131db051c> in ?()
----> 1 scipy.stats.wald.l_moment(1)

~/.local/lib/python3.12/site-packages/lmo/contrib/scipy_stats.py in ?(self, r, trim, quad_opts, *args, **kwds)
    317                 # undefined mean -> distr is "pathological" (e.g. cauchy)
    318                 return np.full(_r.shape, np.nan)[()]
    319 
    320         # L-moments of the standard distribution (loc=0, scale=scale0)
--> 321         l0_r = self._l_moment(_r, *args, trim=_trim, quad_opts=quad_opts)
    322 
    323         # shift (by loc) and scale
    324         shift_r = loc * (_r == 1)

~/.local/lib/python3.12/site-packages/lmo/contrib/scipy_stats.py in ?(self, r, trim, quad_opts, *args)
    169             of specific distributions, `r` and `trim`.
    170 
    171         """
    172         cdf, ppf = self._get_xxf(*args)
--> 173         lmbda_r = l_moment_from_cdf(
    174             cdf,
    175             r,
    176             trim=trim,

~/.local/lib/python3.12/site-packages/lmo/theoretical.py in ?(cdf, r, trim, support, quad_opts, alpha, ppf)
    333             * eval_sh_jacobi(_r - 2, t + 1, s + 1, p)
    334         )
    335 
    336     a, d = support or _tighten_cdf_support(cdf, support)
--> 337     b, c = (ppf(alpha), ppf(1 - alpha)) if ppf else (a, d)
    338 
    339     loc0 = a if np.isfinite(a) and a > 0 else 0
    340 

~/.local/lib/python3.12/site-packages/scipy/stats/_continuous_distns.py in ?(self, x)
  10207     def _ppf(self, x):
> 10208         return invgauss._ppf(x, 1.0)

~/.local/lib/python3.12/site-packages/scipy/stats/_continuous_distns.py in ?(self, x, mu)
   4537         with np.errstate(divide='ignore', over='ignore', invalid='ignore'):
   4538             x, mu = np.broadcast_arrays(x, mu)
   4539             ppf = _boost._invgauss_ppf(x, mu, 1)
   4540             i_wt = x > 0.5  # "wrong tail" - sometimes too inaccurate
-> 4541             ppf[i_wt] = _boost._invgauss_isf(1-x[i_wt], mu[i_wt], 1)
   4542             i_nan = np.isnan(ppf)
   4543             ppf[i_nan] = super()._ppf(x[i_nan], mu[i_nan])
   4544         return ppf

TypeError: 'numpy.float64' object does not support item assignment

Apparantly, scipy.stats.wald._ppf requires the input to be a >0d array-like (a scalar won't do). This isn't the case for most (or perhaps all) other continuous distributions.

Support for numpy 2.0

blocker: numpy/numpy#26199

update:
numpy/numpy#26199 has been fixed; now waiting for the next 2.0.0(.rc2)? release

Release on anaconda

Lmo heard some news about some spreadsheet software that he wants to play with.
But Lmo has to tame a big snake before he can do that.
Luckily, Lmo is best friends with another small snake already.

Relevant resources:

integer overflow in `linalg.sh_legendre` on windows

See: https://github.com/jorenham/Lmo/actions/runs/6687308385/job/18167801378

The np.int_ is a C long, which is 32bits in win64, because, well.. err... logic...?

... anyway, this is could be solved by replacing all np.int_'s by np.int64.

Another option could be to switch _sh_jacobi_f in the cases where _sh_jacobi_i will overflow given k and dtype in
sh_legendre, and similarly in sh_jacobi(k, a, b, dtype).

Note that in practise, this is only a problem in _lm._l_weights_pwm. The _lm._l_weights_ostat method is slower, slightly less precise, but a lot less likely to overflow.

For any users facing this issue; the workaround is to use e.g. trim=1e-15 instead of trim=0. This forces it to use the ostat weights, since the pwm weights can only handle integer trimming.

Variance structure of sample (T)L moments

$\mathbb{E}[l_{k_1}^{(s, t)}, l_{k_2}^{(s, t)}]$

For $s=t$ (L-moments and symmetric TL-moments), Elamir and Seheult (2003) found the exact solution.
For $s \neq t$, more research is required.

incorrect `diagnostic.l_ratio_bounds` for `has_variance=True`

The strict bounds should be divided by the L-scale, not by the upper bound of the L-scale. This results is bounds that are too tight.

E.g. the strict untrimmed L-moment bounds for $r \ge 3$ are $\sqrt{2r - 1} |\lambda_r| \le \sigma$, so the corresponding L-ratio bouds are $\sqrt{2r - 1} |\tau_r| \le \sigma / \lambda_2$.

Incorrect trimming of `inf` samples in TL-moment estimators

>>> a = np.array([0, 1, 2, 3, 4, 5, 6, np.inf])

>>> lmo.l_stats(a, trim=0)
array([inf, inf, nan, nan])

>>> lmo.l_stats(a, trim=(0, 1))
array([nan, nan, nan, nan])

>>> lmo.l_stats(a, trim=1)
array([nan, nan, nan, nan])

Only the trim=0 is correct, but in the other cases, it should be:

>>> lmo.l_stats(a, trim=(0, 1))
array([2.   , 1.125, 0.   , 0.   ])

>>> lmo.l_stats(a, trim=1)
array([3.5, 0.9, 0. , 0. ])

Fractional trimming in `lmo.l_moment_cov`

Currently, only integral trimming is supported. See Hosking (2015), and my "offline analogue notes", for implementation details.

`contrib.sympy`

Optional support for SymPy, e.g. utility functions for:

extend the sympy.stats distributions with symbolic L-moment methods, analogously to scipy.stats
finding closed-form solutions of theoretical
- order stats
- PWM's
- L-moments
- L-comoments
- L-moment (asymptotic) variance-covariance matrix
- IF's, BP's, etc.
- L-MGF's 😏
allow using sympy's numerical integration as an alternative to the current scipy.integrate.quad
support for using the interpolative summation method in l_poly (as alternative to the current naïve summation)
Hahn polynomials (use $_3F_2$; it's absent in scipy), which show up in the trimmed L-moment weights (Hosking, 2015).

A visual explanation of (T)L-moments

Plots of the weights of $l_k^{(s,t)}$ for $k \in \set{1,2,3,4} $, and $s, t \in \set{(0, 0), (1, 1), (0, 1), (1, 0),}$.

Test for normality

Implement the $\tau^2_{3,4}$ for normality from Harri & Coble .

See https://rdrr.io/cran/lmomco/src/R/tau34sq.normtest.R for a reference implementation in R.

Optional numba support

Add contrib.numba, and use a noop @jit if not available on relevant functions.

Additionally, the scipy.integrate.quad integrand can be sped up with cfunc, see https://numba.readthedocs.io/en/stable/user/cfunc.html#example

In places where scipy.special functions are used, some trickery is needed.
For example, this snippet is used to make scipy.special.erfi work within numba jitted functions for np.float64 input:

# lmo/contrib/numba.py
def _overload_scipy_special_erfi():
    _erfi_f8 = ctypes.CFUNCTYPE(ctypes.c_double, ctypes.c_double)(
        numba.extending.get_cython_function_address(
            'scipy.special.cython_special',
            '__pyx_fuse_1erfi',
        ),
    )
    
    @numba.extending.overload(scipy.special.erfi)
    def numba_erfi(*args):
        match args:
            case (numba.types.Float(),):
                def _numba_erfi(*args):
                    return _erfi_f8(*args)
                return _numba_erfi
            case _:
                return None

def overload_scipy_special():
    _overload_scipy_special_erfi()

# lmo/pyproject.toml
[tool.poetry.plugins.numba_extensions]
init = "lmo.contrib.numba:overload_scipy_special"

Realistic usage example in the README.md

Illustrate that L-moments can be better choice than conventional moments

Method of L-moments for unknown (univariate) distributions

depends on #6
related to #5

Short explanation of L-moments in README.md

L-functionals and regression models

Hössjer & Karlsson (2023) describe the framework of L-functionals, which (almost) generalize the Legendre-based (untrimmed) L-moments. They additionally re-invent Hyowon An's Gaussian-centered "HL-moments", and dub them "Hermite L-functionals". Similarly, they introduce the (novel) Laguerre L-functionals with Exp(1) as reference distribution, which sound rather promising IMHO.

They further show how to apply these L-functionals for regression in (transformed) linear- and quantile-regression models (!), using a very flexible, yet practical approach.

Their use of conditional L-functionals might be interesting to attempt to "backport" into the familiar (Legendre- & Jacobi-based) L-moments.

For now, more research into these L-functionals is required before coming up with a concrete implementation plan.

Calculation of the L-moment ratio curves for specific families of distributions, e.g. as $\tau_r(\tau_k | \mathcal{F})$.
Non-parametric L-ratio curve estimation, either using repeated measurements, or some bootstrapping method
Plotting both parametric curves (lineplot), and optionally a non-parametric curve (scatterplot), simultaneously, including helpful legend- and axis labels. Preferrably, this should be backend-agnostic, with (for now) one backend implementation in contrib.matplotlib .

Some examples for inspiration:

M.C. Peel et al. (2001):
J.R.M. Hosking (2015):
J. Galeano-Brajones et al. (2023):