georgebv / pyextremes Goto Github PK

Extreme Value Analysis (EVA) in Python

Home Page: https://georgebv.github.io/pyextremes/

License: MIT License

Python 99.76% Shell 0.24%

extreme-value-analysis extreme-value-statistics statistics extremes extreme-events python eva block-maxima peaks-over-threshold

pyextremes's Introduction

pyextremes

Extreme Value Analysis (EVA) in Python

About

Documentation: https://georgebv.github.io/pyextremes/

License: MIT

Support: ask a question or create an issue, any input is appreciated and would help develop the project

pyextremes is a Python library aimed at performing univariate Extreme Value Analysis (EVA). It provides tools necessary to perform a wide range of tasks required to perform EVA, such as:

extraction of extreme events from time series using methods such as Block Maxima (BM) or Peaks Over Threshold (POT)
fitting continuous distributions, such as GEVD, GPD, or user-specified continous distributions to the extracted extreme events
visualization of model inputs, results, and goodness-of-fit statistics
estimation of extreme events of given probability or return period (e.g. 100-year event) and of corresponding confidence intervals
tools assisting with model selection and tuning, such as selection of block size in BM and threshold in POT

Check out this repository with Jupyter notebooks used to produce figures for this readme and for the official documentation.

Installation

Get latest version from PyPI:

pip install pyextremes

Install with optional dependencies:

pip install pyextremes[full]

Get latest experimental build from GitHub:

pip install "git+https://github.com/georgebv/pyextremes.git#egg=pyextremes"

Get pyextremes for the Anaconda Python distribution:

conda install -c conda-forge pyextremes

Illustrations

Model diagnostic

Extreme value extraction

Trace plot

Corner plot

Acknowledgements

I wanted to give kudos to Jean Toilliez who has inspired me to develop this open-source project and who taught me a lot about the extreme value theory. Also big thanks to Max Larson who has introduced me to software development and statistics.

pyextremes's People

Contributors

Stargazers

Watchers

Forkers

weilin2018 wangwangsuibinbin bkt92 v1docq kurkutesa elicharlese ksopan genesobolev washake tizpaz linroman mohm-dhi mj-alex sinaamini11 toniesteves mowmanabdul oneshojon jihkk manishdhasmana49 munevany spatchcock wt-abe shenxiangzhuang wqshen bairdhaka ecomodeller valteresj2 josuemtzmo eastjames vlntsolo huangynj bprimo ziyit mithundeb pth91 adaptive-insight-hyue maxklr integral-jskow jschueller vidisha7086 hosseinghafarian laurenorr

pyextremes's Issues

Support for covariates

Any plan to support covariates for extreme value analysis similar to this R package:
https://cran.r-project.org/web/packages/extRemes/index.html

Multiprocessing in MLE model prevents use on AWS Lambda Functions

(Firstly I'll just say that this is a fantastic library and very easy to use, great job!)

For whatever reason, AWS lambda functions cannot use multiprocessing.Pool(). This is a problem as it is used by pyextremes when fitting an MLE model. So when I attempt to run my AWS lambda function, it crashes, and the current solution is for me to modify line 199 in model_mle.py to if True, thereby ensuring it does not use the multiprocessing approach.

This is not ideal. It's slightly tedious, but AWS describes this method for using multiprocessing that is compatible with lambda functions. (Instead of using Pool you use Process and Pipe with some additional code to set it up.) Alternatively you could just add a kwarg to avoid the multiprocessing method which would be fine too for those of us not concerned with speed, and much easier to implement.

Long timeseries support: thinking beyond pandas datetime range

I have been trying to use the 700 years worth of time series and the pandas apparently supports at max 584 years for datetime with the unit ns. I was trying to compute the return value stability plot (using pyextremesv2.3.0, and a POT approach) and getting the following error:

Is there any workaround for this? I understand that the function takes a pandas series with the date-time object as its index.

my series looks like this:

Any help is appreciated.

Kind regards,
Dhirendra

Add API description

Unless I'm mistaken, there's no API published. It makes exploring the package difficult. For example, it was not obvious to me where to find the fitted distribution's parameters. I had to look at your code to find out :-) (fortunately, it's very well written so it wasn't hard :-) )

Error in plot_parameter_stability()

I have a continuous column data and wanted to plot plot_parameter_stability() from pyextremes.
But I am getting an error. When I used the same column for plot_mean_residual_life() it outputted the plot perfectly.

Error:
I had hidden the column name for confidential purposes

Sample values of the data:

Code was run on the following versions:

Python version : 3.9.16
pyextremes - 2.3.1

Please help.

KS test gives incorrect test_statistic

First: thank you for creating and sharing this software.

Describe the bug
When performing a KS test on extremes selected with POT and a generalized Pareto distribution, the location parameter is not given to scipy.stats.kstest. As a result, test_statistic is incorrect and the result of the KS test may be incorrect.

To Reproduce

$ wget https://github.com/georgebv/pyextremes-notebooks/raw/master/data/battery_wl.csv

import pyextremes as px
import pandas as pd


data = (
    pd
    .read_csv("battery_wl.csv", index_col=0, parse_dates=True)
    .squeeze('columns')
    .sort_index(ascending=True)
    .astype(float)
    .dropna()
)
data = data.loc[pd.to_datetime("1925"):]
data = data - (data.index.array - pd.to_datetime("1992")) / pd.to_timedelta("365.2425D") * 2.87e-3

model = px.EVA(data=data)

model.get_extremes(
    method="POT",
    extremes_type="high",
    threshold = 1.4,
    r = '12H'
)

model.fit_model()

model.test_ks()

output:

                     Kolmogorov-Smirnov Test                      
------------------------------------------------------------------
Null hypothesis: data follows genpareto distribution              
Alternative hypothesis: data doesn't follow genpareto distribution
                                                                  
Test statistic: D = 0.9906864591234011                            
p-value: 3.829434059334826e-179                                   
                                                                  
Significance level: 0.05                                          
Critical value: 0.14274059053041283                               
Critical region: reject null-hypothesis if D > 0.14274059053041283
                                                                  
Test result: data doesn't follow genpareto distribution           
------------------------------------------------------------------

Note: Test statistic returns D = 0.9906864591234011

Expected behavior
Compute test statics directly with scipy.stats.kstest:

from scipy import stats
import functools

# old test statistic
gpcdf = functools.partial(stats.genpareto.cdf,**model.model.fit_parameters) # fixed parameters not included
print(stats.kstest(model.extremes,gpcdf).statistic)
# 0.9906864591234011

# correct test statistic
gpcdf2 = functools.partial(
    stats.genpareto.cdf,
    **{'loc' if k=='floc' else k:v for k,v in model.distribution.fixed_parameters.items()}, # fixed parameters included
    **model.model.fit_parameters
)
print(stats.kstest(model.extremes,gpcdf2).statistic)
# 0.09295193640733063

System:
Python 3.10.4

  Operating System: CentOS Linux 7 (Core)
      Architecture: x86-64

Multi-dimensional indexing (e.g. `obj[:, None]`) is no longer supported. Convert to a numpy array before indexing instead.

Describe the bug
Example of user guide's threshold selection, althought i ran it successfully.
I create a new environment for it but still fail:
conda create --name EVA_env python=3.10
conda activate EVA_env
pip install pyextremes[full]

specific part
axes = plot_threshold_stability(data, return_period=100, thresholds=np.linspace(0.20, 0.26, 20), progress=True)
ERRO： Multi-dimensional indexing (e.g. obj[:, None]) is no longer supported. Convert to a numpy array before indexing instead.

but Mean Residual Life can be ploted.
$ML89_FNGVWGD{BWK( Y9DSV$

Error in plot_mean_residual_life for scipy v1.11.2

Describe the bug
After updating to scipy v1.11.2 I get an error when executing plot_mean_residual_life(data)

To Reproduce
Steps to reproduce the behavior:

Update scipy
Import example data set batteries
try to execute plot_mean_residual_life(data)

Expected behavior
I expected to see the plot as provided in the tutorial

Screenshots

127     mean_residual_lives.append(exceedances.mean())
128     if alpha is not None:
129         mrl_confidence.append(

--> 130 scipy.stats.norm.interval(
131 alpha=alpha,
132 loc=exceedances.mean(),
133 scale=exceedances.std(ddof=1) / np.sqrt(len(exceedances)),
134 )
135 )
137 with plt.rc_context(rc=pyextremes_rc):
138 if ax is None:

TypeError: interval() missing 1 required positional argument: 'confidence'

Desktop (please complete the following information):

macOS
Python version 3.9.16

Additional context
Thanks!

Support of covariates

Hello!

Thanks so much for this package!

I am just wondering if the inclusion of covariates into the parameters will be supported in the future versions.

Confidence interval question

I did a block maxima with a block period of one year and I fitted a gumbel distribution (this is snow data FYI). I used a strong confidence interval of 0.999.

block_size="365.2425D"
confidence_interval = 0.999
model = EVA(series)
model.get_extremes(method="BM", block_size=block_size, errors='ignore')
model.fit_model(model='MLE', distribution='gumbel_r')
model.plot_diagnostic(alpha=confidence_interval)

I was excpecting the confidence interval to be more like the one in the quickstart (picture below) https://georgebv.github.io/pyextremes/quickstart/

Do you know why there is a "linear" confidence interval ?

I have scipy 1.10.1 and pyextremes 2.2.7.

Digital Object Identifier for pyextremes

Is your feature request related to a problem? Please describe.
I work in academic space science research. When publishing research, journals are becoming more strict about citing the code that has been used, often requiring a digital object identifier (DOI). See example here:
https://www.agu.org/publish-with-agu/publish/author-resources/data-and-software-for-authors
Scroll to "Data & Software Citation":
"
Citations should include:

Author(s) or project name(s)
Date / Software published
Title / Software name
Data or software release/version (optional)
Bracketed description type (e.g., [Dataset], [Software], [Collection], [ComputationalNotebook])
Repository name / Publication venue
DOI, persistent identifier, URL
Retrieved date (required when using URL)

Describe the solution you'd like
It would be really great if you could generate a DOI for new releases of pyextremes. It's quite easy to do, instructions here: https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content. An added benefit is it would give you an indication of where/how your great package is being used in academic literature.

Getting confidence intervals for MLE after fitting the model

The estimated parameters can be seen after fitting the model but is there a way to see parameter estimations with upper and lower bounds as in the get_return_value method?

Thanks in advance for your effort.

What's the meaning of "ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions"

Hello,

I try to use your package. I pass it a panda series of events. Those events are storms. I provide a time series of storm durations (in hours) and storm begin time (index of the series). There can be several storms at the same time and there are times without storms. Then I run:

model = EVA(pd.Series(durations, np.sort(dates))) # Dates are not important, durations are.
model.get_extremes(method="BM", block_size="10D")
model.plot_extremes()

which ends up with:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (731,) + inhomogeneous part.

I believe you expect data at regular intervals (and so I must fill in some gaps). Correct ? I'll continue fiddling a bit. This issue is more about giving an error message one (novice) can understand.

How are your confidence intervals calculated?

I was reading through your excellent online documentation and one thing that was missing was a basic description of how your confidence intervals are calculated. I'm assuming this placeholder is where that information will go:
https://github.com/georgebv/pyextremes/blob/master/docs/user-guide/12-confidence-intervals.md

I was wondering if there are plans to complete that documentation page (and the user guide more generally), or is the documentation no longer being actively developed?

Residual plot to show performance of function fit

Consider adding a diagnostic plot that displays residuals defined as ECDF (from Statsmodels, e.g.) - fitted CDF. This would add value and help assess goodness of fit.

Make pyextremes citable with zenodo?

Pyextremes is useful and flexible and it has been helpful in my research. Can it be made citable?

For example, github provides the option for creating persistent identifiers through zenodo: https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content

If you are not interested in creating the persistent identifier for the repository, is there another citation you could suggest for your software? In addition to the Coles 2001 extreme value theory text.

Thank you!

On GEV maximum likelihood estimation and MCMC with pyextremes

I used the MATLAB 'gevfit' command to get the location parameter and the the scale parameter. These are the same as those fitted by 'pyextremes'(MCMC AND MLM), but the shape parameter k obtained from 'gevfit' is negative, and the shape parameter k' obtained from 'pyextremes'(MCMC AND MLM) are positive value which is the absolute value of the k by a lucky coincidence. Does the program 'pyextremes'(MCMC AND MLM) not set the display for parameter negative value? Or doI need to set something else?

THANKS! HOPE FOR YOU ANSWER!

model.get_summary and model.plot_diagnostics taking a long time

I'm following your Quick Start tutorial with my own data, and I notice for a few time series the methods model.get_summary and model.plot_diagnostics hang, taking over 30 minutes before I kill the cell. I've checked that the time series is uniformly spaced in time and has all valid values between 0 and 40. I'm not sure what's going on, are there diagnostics I could run? Or do you have intuition why this might be happening? Cheers

BUG: Results are not matched with ismev for MLE

Describe the bug
I read a book about extreme value statistics, and found this package. Then, I want to validate the same results are obtained or not.
The resutlts were finally not matched...

import pandas as pd
from pyextremes import EVA
KobeM = pd.Series([129.7,  51.4, 148.8,  76.3,  68.3,  91.8, 116.5,  97.4,  81.3,  65.8,  50.9, 100.2,
  123.2, 199.4, 137.7, 104.7,  72.6,  63.2, 116.5,  71.9, 121.6,  87.8,  89.2,  76.2,
  103.7, 112.0,  60.7,  63.2, 119.7,  75.5,  68.3,  67.6,  62.3,  51.1,  62.5,  99.4,
   81.4,  86.3, 143.8,  44.5,  58.4, 270.4, 103.5,  94.0,  68.9,  80.0, 112.5,  73.4,
  262.8,  58.1,  43.3,  59.0, 181.8, 117.1, 109.4, 115.9, 116.2, 175.8,  73.4,  83.7,
  165.6,  56.8, 123.4, 142.2, 195.2, 130.1,  64.5,  97.1, 219.4,  95.7, 319.4,  72.5,
  115.5,  53.0,  81.0, 152.0,  68.5,  83.5,  77.0,  77.5,  67.5,  91.0,  86.5,  77.0,
   57.0, 125.5, 197.0, 105.5,  75.5, 128.0,  70.0, 150.0, 121.5,  72.0,  57.0,  81.0,
   86.0,  40.5, 118.5,  76.5,  82.5, 122.0, 179.5, 123.0,  71.5,  38.5, 108.5, 138.0,
   57.0,  76.5,  83.0,  47.0,  69.0, 105.5, 124.5,  64.0, 100.5])

KobeM.index = [pd.Timestamp(i,1,1) for i in range(1897, 2014)]

model = EVA(KobeM)

model.get_extremes(method="BM", block_size='365.2425D')

model.fit_model()


                           Univariate Extreme Value Analysis                            
========================================================================================
                                      Source Data                                       
----------------------------------------------------------------------------------------
Data label:                          None      Size:                                 117
Start:                       January 1897      End:                         January 2013
========================================================================================
                                     Extreme Values                                     
----------------------------------------------------------------------------------------
Count:                                116      Extraction method:                     BM
Type:                                high      Block size:             365 days 05:49:12
========================================================================================
                                         Model                                          
----------------------------------------------------------------------------------------
Model:                                MLE      Distribution:                  genextreme
Log-likelihood:                  -583.328      AIC:                             1172.869
----------------------------------------------------------------------------------------
Free parameters:                 c=-0.192      Fixed parameters: All parameters are free
                               loc=78.353                                               
                             scale=28.215                                               
========================================================================================

R's ismev's code is this

library(ismev)
KobeM <- c(129.7,  51.4, 148.8,  76.3,  68.3,  91.8, 116.5,  97.4,  81.3,  65.8,  50.9, 100.2,
  123.2, 199.4, 137.7, 104.7,  72.6,  63.2, 116.5,  71.9, 121.6,  87.8,  89.2,  76.2,
  103.7, 112.0,  60.7,  63.2, 119.7,  75.5,  68.3,  67.6,  62.3,  51.1,  62.5,  99.4,
   81.4,  86.3, 143.8,  44.5,  58.4, 270.4, 103.5,  94.0,  68.9,  80.0, 112.5,  73.4,
  262.8,  58.1,  43.3,  59.0, 181.8, 117.1, 109.4, 115.9, 116.2, 175.8,  73.4,  83.7,
  165.6,  56.8, 123.4, 142.2, 195.2, 130.1,  64.5,  97.1, 219.4,  95.7, 319.4,  72.5,
  115.5,  53.0,  81.0, 152.0,  68.5,  83.5,  77.0,  77.5,  67.5,  91.0,  86.5,  77.0,
   57.0, 125.5, 197.0, 105.5,  75.5, 128.0,  70.0, 150.0, 121.5,  72.0,  57.0,  81.0,
   86.0,  40.5, 118.5,  76.5,  82.5, 122.0, 179.5, 123.0,  71.5,  38.5, 108.5, 138.0,
   57.0,  76.5,  83.0,  47.0,  69.0, 105.5, 124.5,  64.0, 100.5)
k000 <- gev.fit(KobeM)

results are

$conv
[1] 0
$nllh
[1] 588.2654
$mle
[1] 77.8499914 28.1150960  0.1970488
$se
[1] 2.96183720 2.35426533 0.07809392

Expected behavior
I think number of iterations are not enough for this fitting.
Also, I wonder when estimating parameters, what kind of pacakges is used for parameter optimization.,,

Desktop (please complete the following information):

OS : Linux
Python version: 3.10.4

Extracting confidence intervals on fit parameters 'c', 'loc', and 'scale'

Hi,

Thanks for creating this super helpful python package, it's really easy to use and the documentation are really helpful. I use this package for space science research.

I'm interested in returning the confidence intervals or some other measure of error on the fit parameters 'c', 'loc', and 'scale'.

Currently I access the fit parameters by doing: EVA.model.fit_parameters . Having confidence intervals (or any other measure of error) on these parameters would be a great addition to the package. Apologies if this feature has already been implemented - if it has please could you guide me on how to extract the confidence intervals / errors?

Thanks,
Alexandra Fogg

When getting extremes with threshold, pyextremes should warn that the threshold is too high/low

When doing

model.get_extremes(method="POT", threshold=20,..., extremes_type="high"

if the threshold is set too high so that no extreme can be found, pyextremes gives an stacktrace which is bit hard to understand (ValueError: attempt to get argmax of an empty sequence). It'd be nice to have a proper error message saying that no extreme could be found.

Error when running your quick start example

I tried to run the quick start example on your website and got an error when trying to plot the diagnostics.

You can view the error in the following notebook:
https://github.com/climate-resilient-enterprise/frequency-analysis/blob/master/examples/pyextremes_example.ipynb

It was run on Rocky Linux 8.6, with pyextremes version 2.2.4 and python version 3.10.5.

change visuals of the plot

THE CODE ARE LISTED AS FOLLOWS:

fig, ax = plot_extremes(
    ts=series,
    extremes=extremes,
    extremes_method="BM",
    extremes_type="high",
    block_size="365.2425D",
    figsize=(10, 4),
    ax={"lines.color":"green","font.size": 12,"axes.linewidth": 0.8 }
)

THE ERROR ARE LISTED:
TypeError: invalid type in <class 'dict'> for the 'ax' argument, must be matplotlib Axes object

what is the right form for the argument "ax" and how to change it

Thanks

alternative to block_size

First: Thanks for this package, I like it a lot (and it is much faster than my own implementation)!

My problem is ''block_size'':
If I am interested in annual maxima, it easily happens that the selected blocks traverse "hard boundaries" such as 31.12/01.01 (or 31.08/01.09 if I am interested in school years for instance). It could thus happen (rarely of course) that my annual maxima is attributed to the wrong year). This is of course related to the problem of leap years.

A worst-case scenario would probably be:
A maximum any time in e.g. 2020-12 and another "almost-maximum" at 2021-01-01 03:00. Then for the rest in 2021 no real high value (anywhere near the last two mentioned ones). It then could be that 2021-01-01 03:00 would be counted to the annual block of 2020 (and thus not appear in the extremes). However, I would really prefer to have the value of 2021-01-01 03:00 counted to 2021 and therefore provide another extreme value for 2021.

The date_time_intervals are constructed from my first element in the time series (ts) and the block_size. If I am aware of this (which I am), I can fill up my ts with 0's to my desired start of the year (not with nan's because they will be removed by pyextremes before building the date_time_intervals, and hence I would get even stranger year-periods).

My desired solution:
For my purpose, it would be nice to pass date_time_intervals to pyextremes (get_extremes_block_maxima) directly. This would allow me to have hard boundaries at years.

I could imagine this "problem" is even more severe if one looks at monthly blocks: an average block_size would constantly traverse hard month-boundaries.

Anyway, thanks again, and I would be interested to know if my block_size problem is worth to be considered.