Coder Social home page Coder Social logo

jorenham / lmo Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 1.0 3.83 MB

Trimmed L-moments and L-comoments for robust statistics.

Home Page: https://jorenham.github.io/Lmo/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
l-moments numpy python statistics robust-optimization robust-statistics fitting-distribution multivariate-statistics probability-statistics l-comoments

lmo's Introduction

Lmo - Trimmed L-moments and L-comoments

GitHub Workflow Status license Lmo - PyPI Lmo - Versions Ruff Pyright

Unlike the legacy product-moments, the L-moments uniquely describe a probability distribution, and are more robust and efficient.

The "L" stands for Linear; it is a linear combination of order statistics. So Lmo is as fast as sorting your samples (in terms of time-complexity).

Key Features

  • Calculates trimmed L-moments and L-comoments, from samples or any scipy.stats distribution.
  • Full support for trimmed L-moment (TL-moments), e.g. lmo.l_moment(..., trim=(1/137, 3.1416)).
  • Generalized Method of L-moments: robust distribution fitting that beats MLE.
  • Fast estimation of L-comoment matrices from your multidimensional data or multivariate distribution.
  • Goodness-of-fit test, using L-moment or L-moment ratio's.
  • Exact (co)variance structure of the sample- and population L-moments.
  • Theoretical & empirical influence functions of L-moments & L-ratio's.
  • Complete docs, including detailed API reference with usage examples and with mathematical $\TeX$ definitions.
  • Clean Pythonic syntax for ease of use.
  • Vectorized functions for very fast fitting.
  • Fully typed, tested, and tickled.
  • Optional Pandas integration.

Quick example

Even if your data is pathological like Cauchy, and the L-moments are not defined, the trimmed L-moments (TL-moments) can be used instead. Let's calculate the TL-location and TL-scale of a small amount of samples:

>>> import numpy as np
>>> import lmo
>>> rng = np.random.default_rng(1980)
>>> x = rng.standard_cauchy(96)  # pickle me, Lmo
>>> lmo.l_moment(x, [1, 2], trim=(1, 1)).
array([-0.17937038,  0.68287665])

Now compare with the theoretical standard Cauchy TL-moments:

>>> from scipy.stats import cauchy
>>> cauchy.l_moment([1, 2], trim=(1, 1))
array([0.        , 0.69782723])

See the documentation for more examples and the API reference.

Roadmap

  • Automatic trim-length selection.
  • Plotting utilities (deps optional), e.g. for L-moment ratio diagrams.

Installation

Lmo is on PyPI, so you can do something like:

pip install lmo

Required dependencies

These are automatically installed by your package manager, alongside lmo.

Package Minimum version
Python 3.10
NumPy See NEP 29
SciPy 1.9

Optional dependencies

Package Minimum version Notes
Pandas 1.4 Installable as lmo[pandas]

Foundational Literature

lmo's People

Contributors

dependabot[bot] avatar jorenham avatar wolph avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

wolph

lmo's Issues

Empirical influence functions

Analogous to the lmo.theoretical influence functions from #22, add the following methods:

  • lmo.l_moment_influence(x: array_like[number], r: int, *, **) -> (T) -> T
  • lmo.l_ratio_influence(x: array_like[number], r: int, k: int = 2, *, **) -> (T) -> T

Support multivariate `scipy.stats` continuous random vectors

Extend the scipy.stats joint/multivariate distributions with L-comoment (ratio) methods, using lmo.theoretical.l_comoment_from_pdf and lmo.theoretical.l_coratio_from_pdf.

This requires the joint PDF, and the marginal CDF's.
Unfortunately, the marginals are nowhere to be found in scipy.stats._multivariate.multi_rv_generic or its subtypes.

So the only way to implement this feature, is by figuring out the marginals of each joint-distribution, and manually add the l_co(moment|ratio|stats|loc|scale|rr|skew|kurtosis) methods to their respective scipy.stats._multivariate.{}_(gen|frozen)
types.

  • multivariate_normal (norm marginals)
  • dirichlet (beta marginals)
  • multivariate_t (t marginals)

The 13 other ones are either discrete, matrix-valued, directional, or have no .pdf() method, and therefore out-of-scope.

`scipy.stats.distributions` integration

  • add methods l_moment, l_stats, etc. to the "generic" and "frozen" distn bases (monkeypatch)
    • rv_continuous (depends #6), and use #5 + the name attribute to check if specific theoretical L-moments are known
    • rv_discrete (depends #9)
  • for the theoretically-known distributions from #5, override l_moment on the specific rv_continuous instances

Fractional trimming in sample L-moment

Modify lmo.l_weights to accept trim: tuple[float, float], by replacing the succession-matrix algorithm with a direct approach, so that gamma functions can be used (instead of comb).

This is already possible for population L-moments.

Support `__array_ufunc__` in `lmo.l_(loc|scale|variation|skew|kurtosis)`

Currently, these always return a np.ndarray, even when the input is an e.g. pd.Series.

By making these functions aware of the potential __array_ufunc__ or __array_function__ methods, it can automatically "lift" itself, so that the types of ufunc-aware instances can pass-though.

It's probably best to use np.frompyfunc for this.

Note that this will require additional @overload's.

improve the `import lmo` runtime

Most of the startup time comes from numpy and scipy imports.

In some cases scipy imports can be replaced with stdlib math functions.
In other cases it sometimes to make sense to move the top-level imports into the relevant functions.

Allow passing `scipy.stats` univariate distributions to the `lmo.l_*` functions

The current monkey-patched methods have several issues:

  • type annotations are impossible for extensions methods, and I've seen no attempts at a PEP that plans to add the required functionality
  • relies on undocumented scipy.stats.rv_* internals, which can break without warning
  • not guaranteed to work when using multiprocessing (it might be fine when using fork, but with e.g. spawn or joblib+loky I suspect it could break, although I have yet to test this).
  • difficult to document properly within mkdocs (hacks required), especially when considering that the docstring styles of scipy and lmo are incompatible.
  • negatively affects the performance of import lmo (the first time)

Instead, it'd be better to extend the lmo.l_* moment functions, making them accept the rv_ instances directly.

  • implement for discrete RV's
  • implement for continuous RV's
  • document the usage, and add (doctest) examples
  • write (hypothesis) tests for the discrete/continuous RV's in lmo.distributions and scipy.stats
  • use two sepearate TypedDict's for **kwargs for specifying extra options to the underlying sample- and population- L-moments estimators
  • add appropriate @overload's.
  • remove lmo.contrib.scipy_stats.l_rv_generic and the related monkey-patch machinery

Arbitrary concomitants in the prescence of duplicates

This can be solved with either np.lexsort or np.argsort + structured type's. So a small performance test is requires to select the best of the two.

Alternatively, a post-sort check can quickly be done to identify (consecutive) duplicate values. These sub-array's can then individually be re-ordered, using the concomitant's ordering. But I suspect this to be a lot slower that the other two options.

Asymptotic covariance

Asymptotic covariance is defined as

$$\mathbb{AV}_F[T_i, T_j] = \mathbb{E}_F[\psi_{T_i | F}(X) \; \psi_{T_j | F}(X)]\; ,$$

with statistical functional $T_r$ either an L-moment or L-ratio (theoretical or sample estimate), $\psi_{T_j | F}(X)$ its influence function, and $F(x)$ the CDF or ECDF.

  • parametric
  • nonparametric

`scipy.stats.wald.l_moment` bug

>>> scipy.stats.wald.l_moment(1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-bb1131db051c> in ?()
----> 1 scipy.stats.wald.l_moment(1)

~/.local/lib/python3.12/site-packages/lmo/contrib/scipy_stats.py in ?(self, r, trim, quad_opts, *args, **kwds)
    317                 # undefined mean -> distr is "pathological" (e.g. cauchy)
    318                 return np.full(_r.shape, np.nan)[()]
    319 
    320         # L-moments of the standard distribution (loc=0, scale=scale0)
--> 321         l0_r = self._l_moment(_r, *args, trim=_trim, quad_opts=quad_opts)
    322 
    323         # shift (by loc) and scale
    324         shift_r = loc * (_r == 1)

~/.local/lib/python3.12/site-packages/lmo/contrib/scipy_stats.py in ?(self, r, trim, quad_opts, *args)
    169             of specific distributions, `r` and `trim`.
    170 
    171         """
    172         cdf, ppf = self._get_xxf(*args)
--> 173         lmbda_r = l_moment_from_cdf(
    174             cdf,
    175             r,
    176             trim=trim,

~/.local/lib/python3.12/site-packages/lmo/theoretical.py in ?(cdf, r, trim, support, quad_opts, alpha, ppf)
    333             * eval_sh_jacobi(_r - 2, t + 1, s + 1, p)
    334         )
    335 
    336     a, d = support or _tighten_cdf_support(cdf, support)
--> 337     b, c = (ppf(alpha), ppf(1 - alpha)) if ppf else (a, d)
    338 
    339     loc0 = a if np.isfinite(a) and a > 0 else 0
    340 

~/.local/lib/python3.12/site-packages/scipy/stats/_continuous_distns.py in ?(self, x)
  10207     def _ppf(self, x):
> 10208         return invgauss._ppf(x, 1.0)

~/.local/lib/python3.12/site-packages/scipy/stats/_continuous_distns.py in ?(self, x, mu)
   4537         with np.errstate(divide='ignore', over='ignore', invalid='ignore'):
   4538             x, mu = np.broadcast_arrays(x, mu)
   4539             ppf = _boost._invgauss_ppf(x, mu, 1)
   4540             i_wt = x > 0.5  # "wrong tail" - sometimes too inaccurate
-> 4541             ppf[i_wt] = _boost._invgauss_isf(1-x[i_wt], mu[i_wt], 1)
   4542             i_nan = np.isnan(ppf)
   4543             ppf[i_nan] = super()._ppf(x[i_nan], mu[i_nan])
   4544         return ppf

TypeError: 'numpy.float64' object does not support item assignment

Apparantly, scipy.stats.wald._ppf requires the input to be a >0d array-like (a scalar won't do). This isn't the case for most (or perhaps all) other continuous distributions.

integer overflow in `linalg.sh_legendre` on windows

See: https://github.com/jorenham/Lmo/actions/runs/6687308385/job/18167801378

The np.int_ is a C long, which is 32bits in win64, because, well.. err... logic...?

... anyway, this is could be solved by replacing all np.int_'s by np.int64.

Another option could be to switch _sh_jacobi_f in the cases where _sh_jacobi_i will overflow given k and dtype in
sh_legendre, and similarly in sh_jacobi(k, a, b, dtype).

Note that in practise, this is only a problem in _lm._l_weights_pwm. The _lm._l_weights_ostat method is slower, slightly less precise, but a lot less likely to overflow.

For any users facing this issue; the workaround is to use e.g. trim=1e-15 instead of trim=0. This forces it to use the ostat weights, since the pwm weights can only handle integer trimming.

Variance structure of sample (T)L moments

$\mathbb{E}[l_{k_1}^{(s, t)}, l_{k_2}^{(s, t)}]$

  • For $s=t$ (L-moments and symmetric TL-moments), Elamir and Seheult (2003) found the exact solution.
  • For $s \neq t$, more research is required.

incorrect `diagnostic.l_ratio_bounds` for `has_variance=True`

The strict bounds should be divided by the L-scale, not by the upper bound of the L-scale. This results is bounds that are too tight.

E.g. the strict untrimmed L-moment bounds for $r \ge 3$ are $\sqrt{2r - 1} |\lambda_r| \le \sigma$, so the corresponding L-ratio bouds are $\sqrt{2r - 1} |\tau_r| \le \sigma / \lambda_2$.

Incorrect trimming of `inf` samples in TL-moment estimators

>>> a = np.array([0, 1, 2, 3, 4, 5, 6, np.inf])

>>> lmo.l_stats(a, trim=0)
array([inf, inf, nan, nan])

>>> lmo.l_stats(a, trim=(0, 1))
array([nan, nan, nan, nan])

>>> lmo.l_stats(a, trim=1)
array([nan, nan, nan, nan])

Only the trim=0 is correct, but in the other cases, it should be:

>>> lmo.l_stats(a, trim=(0, 1))
array([2.   , 1.125, 0.   , 0.   ])

>>> lmo.l_stats(a, trim=1)
array([3.5, 0.9, 0. , 0. ])

`contrib.sympy`

Optional support for SymPy, e.g. utility functions for:

  • extend the sympy.stats distributions with symbolic L-moment methods, analogously to scipy.stats
  • finding closed-form solutions of theoretical
    • order stats
    • PWM's
    • L-moments
    • L-comoments
    • L-moment (asymptotic) variance-covariance matrix
    • IF's, BP's, etc.
    • L-MGF's 😏
  • allow using sympy's numerical integration as an alternative to the current scipy.integrate.quad
  • support for using the interpolative summation method in l_poly (as alternative to the current naïve summation)
  • Hahn polynomials (use $_3F_2$; it's absent in scipy), which show up in the trimmed L-moment weights (Hosking, 2015).

Optional numba support

Add contrib.numba, and use a noop @jit if not available on relevant functions.

Additionally, the scipy.integrate.quad integrand can be sped up with cfunc, see https://numba.readthedocs.io/en/stable/user/cfunc.html#example

In places where scipy.special functions are used, some trickery is needed.
For example, this snippet is used to make scipy.special.erfi work within numba jitted functions for np.float64 input:

# lmo/contrib/numba.py
def _overload_scipy_special_erfi():
    _erfi_f8 = ctypes.CFUNCTYPE(ctypes.c_double, ctypes.c_double)(
        numba.extending.get_cython_function_address(
            'scipy.special.cython_special',
            '__pyx_fuse_1erfi',
        ),
    )
    
    @numba.extending.overload(scipy.special.erfi)
    def numba_erfi(*args):
        match args:
            case (numba.types.Float(),):
                def _numba_erfi(*args):
                    return _erfi_f8(*args)
                return _numba_erfi
            case _:
                return None

def overload_scipy_special():
    _overload_scipy_special_erfi()
# lmo/pyproject.toml
[tool.poetry.plugins.numba_extensions]
init = "lmo.contrib.numba:overload_scipy_special"

L-functionals and regression models

Hössjer & Karlsson (2023) describe the framework of L-functionals, which (almost) generalize the Legendre-based (untrimmed) L-moments. They additionally re-invent Hyowon An's Gaussian-centered "HL-moments", and dub them "Hermite L-functionals". Similarly, they introduce the (novel) Laguerre L-functionals with Exp(1) as reference distribution, which sound rather promising IMHO.

They further show how to apply these L-functionals for regression in (transformed) linear- and quantile-regression models (!), using a very flexible, yet practical approach.

Their use of conditional L-functionals might be interesting to attempt to "backport" into the familiar (Legendre- & Jacobi-based) L-moments.

For now, more research into these L-functionals is required before coming up with a concrete implementation plan.

L-moment ratio diagram plotting utilities

L-moment ratio diagrams are the swiss jacknife for distribution identification, and is in general a very useful tool for statistical exploratory analysis.

This requires:

  • Calculation of the L-moment ratio curves for specific families of distributions, e.g. as $\tau_r(\tau_k | \mathcal{F})$.
  • Non-parametric L-ratio curve estimation, either using repeated measurements, or some bootstrapping method
  • Plotting both parametric curves (lineplot), and optionally a non-parametric curve (scatterplot), simultaneously, including helpful legend- and axis labels. Preferrably, this should be backend-agnostic, with (for now) one backend implementation in contrib.matplotlib .

Some examples for inspiration:

  • M.C. Peel et al. (2001): image
  • J.R.M. Hosking (2015): image
  • J. Galeano-Brajones et al. (2023): image image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.