My name is George and I am a software developer. I reside in Dublin, California, and I am currently working at Kicksaw.
- Email: [email protected]
- LinkedIn: linkedin.com/in/gbocharov
Extreme Value Analysis (EVA) in Python
Home Page: https://georgebv.github.io/pyextremes/
License: MIT License
My name is George and I am a software developer. I reside in Dublin, California, and I am currently working at Kicksaw.
Is your feature request related to a problem? Please describe.
I work in academic space science research. When publishing research, journals are becoming more strict about citing the code that has been used, often requiring a digital object identifier (DOI). See example here:
https://www.agu.org/publish-with-agu/publish/author-resources/data-and-software-for-authors
Scroll to "Data & Software Citation":
"
Citations should include:
"
Describe the solution you'd like
It would be really great if you could generate a DOI for new releases of pyextremes. It's quite easy to do, instructions here: https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content. An added benefit is it would give you an indication of where/how your great package is being used in academic literature.
Describe the bug
After updating to scipy v1.11.2 I get an error when executing plot_mean_residual_life(data)
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I expected to see the plot as provided in the tutorial
Screenshots
127 mean_residual_lives.append(exceedances.mean())
128 if alpha is not None:
129 mrl_confidence.append(
--> 130 scipy.stats.norm.interval(
131 alpha=alpha,
132 loc=exceedances.mean(),
133 scale=exceedances.std(ddof=1) / np.sqrt(len(exceedances)),
134 )
135 )
137 with plt.rc_context(rc=pyextremes_rc):
138 if ax is None:
TypeError: interval() missing 1 required positional argument: 'confidence'
Desktop (please complete the following information):
Additional context
Thanks!
Describe the bug
I read a book about extreme value statistics, and found this package. Then, I want to validate the same results are obtained or not.
The resutlts were finally not matched...
import pandas as pd
from pyextremes import EVA
KobeM = pd.Series([129.7, 51.4, 148.8, 76.3, 68.3, 91.8, 116.5, 97.4, 81.3, 65.8, 50.9, 100.2,
123.2, 199.4, 137.7, 104.7, 72.6, 63.2, 116.5, 71.9, 121.6, 87.8, 89.2, 76.2,
103.7, 112.0, 60.7, 63.2, 119.7, 75.5, 68.3, 67.6, 62.3, 51.1, 62.5, 99.4,
81.4, 86.3, 143.8, 44.5, 58.4, 270.4, 103.5, 94.0, 68.9, 80.0, 112.5, 73.4,
262.8, 58.1, 43.3, 59.0, 181.8, 117.1, 109.4, 115.9, 116.2, 175.8, 73.4, 83.7,
165.6, 56.8, 123.4, 142.2, 195.2, 130.1, 64.5, 97.1, 219.4, 95.7, 319.4, 72.5,
115.5, 53.0, 81.0, 152.0, 68.5, 83.5, 77.0, 77.5, 67.5, 91.0, 86.5, 77.0,
57.0, 125.5, 197.0, 105.5, 75.5, 128.0, 70.0, 150.0, 121.5, 72.0, 57.0, 81.0,
86.0, 40.5, 118.5, 76.5, 82.5, 122.0, 179.5, 123.0, 71.5, 38.5, 108.5, 138.0,
57.0, 76.5, 83.0, 47.0, 69.0, 105.5, 124.5, 64.0, 100.5])
KobeM.index = [pd.Timestamp(i,1,1) for i in range(1897, 2014)]
model = EVA(KobeM)
model.get_extremes(method="BM", block_size='365.2425D')
model.fit_model()
Univariate Extreme Value Analysis
========================================================================================
Source Data
----------------------------------------------------------------------------------------
Data label: None Size: 117
Start: January 1897 End: January 2013
========================================================================================
Extreme Values
----------------------------------------------------------------------------------------
Count: 116 Extraction method: BM
Type: high Block size: 365 days 05:49:12
========================================================================================
Model
----------------------------------------------------------------------------------------
Model: MLE Distribution: genextreme
Log-likelihood: -583.328 AIC: 1172.869
----------------------------------------------------------------------------------------
Free parameters: c=-0.192 Fixed parameters: All parameters are free
loc=78.353
scale=28.215
========================================================================================
R's ismev's code is this
library(ismev)
KobeM <- c(129.7, 51.4, 148.8, 76.3, 68.3, 91.8, 116.5, 97.4, 81.3, 65.8, 50.9, 100.2,
123.2, 199.4, 137.7, 104.7, 72.6, 63.2, 116.5, 71.9, 121.6, 87.8, 89.2, 76.2,
103.7, 112.0, 60.7, 63.2, 119.7, 75.5, 68.3, 67.6, 62.3, 51.1, 62.5, 99.4,
81.4, 86.3, 143.8, 44.5, 58.4, 270.4, 103.5, 94.0, 68.9, 80.0, 112.5, 73.4,
262.8, 58.1, 43.3, 59.0, 181.8, 117.1, 109.4, 115.9, 116.2, 175.8, 73.4, 83.7,
165.6, 56.8, 123.4, 142.2, 195.2, 130.1, 64.5, 97.1, 219.4, 95.7, 319.4, 72.5,
115.5, 53.0, 81.0, 152.0, 68.5, 83.5, 77.0, 77.5, 67.5, 91.0, 86.5, 77.0,
57.0, 125.5, 197.0, 105.5, 75.5, 128.0, 70.0, 150.0, 121.5, 72.0, 57.0, 81.0,
86.0, 40.5, 118.5, 76.5, 82.5, 122.0, 179.5, 123.0, 71.5, 38.5, 108.5, 138.0,
57.0, 76.5, 83.0, 47.0, 69.0, 105.5, 124.5, 64.0, 100.5)
k000 <- gev.fit(KobeM)
results are
$conv
[1] 0
$nllh
[1] 588.2654
$mle
[1] 77.8499914 28.1150960 0.1970488
$se
[1] 2.96183720 2.35426533 0.07809392
Expected behavior
I think number of iterations are not enough for this fitting.
Also, I wonder when estimating parameters, what kind of pacakges is used for parameter optimization.,,
Desktop (please complete the following information):
Pyextremes is useful and flexible and it has been helpful in my research. Can it be made citable?
For example, github provides the option for creating persistent identifiers through zenodo: https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content
If you are not interested in creating the persistent identifier for the repository, is there another citation you could suggest for your software? In addition to the Coles 2001 extreme value theory text.
Thank you!
I have a continuous column data and wanted to plot plot_parameter_stability() from pyextremes.
But I am getting an error. When I used the same column for plot_mean_residual_life() it outputted the plot perfectly.
Error:
I had hidden the column name for confidential purposes
Code was run on the following versions:
Please help.
Hi,
Thanks for creating this super helpful python package, it's really easy to use and the documentation are really helpful. I use this package for space science research.
I'm interested in returning the confidence intervals or some other measure of error on the fit parameters 'c', 'loc', and 'scale'.
Currently I access the fit parameters by doing: EVA.model.fit_parameters
. Having confidence intervals (or any other measure of error) on these parameters would be a great addition to the package. Apologies if this feature has already been implemented - if it has please could you guide me on how to extract the confidence intervals / errors?
Thanks,
Alexandra Fogg
Hi
I have been trying to use the 700 years worth of time series and the pandas apparently supports at max 584 years for datetime with the unit ns. I was trying to compute the return value stability plot (using pyextremesv2.3.0, and a POT approach) and getting the following error:
Is there any workaround for this? I understand that the function takes a pandas series with the date-time object as its index.
Any help is appreciated.
Kind regards,
Dhirendra
Any plan to support covariates for extreme value analysis similar to this R package:
https://cran.r-project.org/web/packages/extRemes/index.html
When doing
model.get_extremes(method="POT", threshold=20,..., extremes_type="high"
if the threshold is set too high so that no extreme can be found, pyextremes gives an stacktrace which is bit hard to understand (ValueError: attempt to get argmax of an empty sequence
). It'd be nice to have a proper error message saying that no extreme could be found.
Unless I'm mistaken, there's no API published. It makes exploring the package difficult. For example, it was not obvious to me where to find the fitted distribution's parameters. I had to look at your code to find out :-) (fortunately, it's very well written so it wasn't hard :-) )
(Firstly I'll just say that this is a fantastic library and very easy to use, great job!)
For whatever reason, AWS lambda functions cannot use multiprocessing.Pool(). This is a problem as it is used by pyextremes when fitting an MLE model. So when I attempt to run my AWS lambda function, it crashes, and the current solution is for me to modify line 199 in model_mle.py to if True
, thereby ensuring it does not use the multiprocessing approach.
This is not ideal. It's slightly tedious, but AWS describes this method for using multiprocessing that is compatible with lambda functions. (Instead of using Pool you use Process and Pipe with some additional code to set it up.) Alternatively you could just add a kwarg to avoid the multiprocessing method which would be fine too for those of us not concerned with speed, and much easier to implement.
Hi
I'm following your Quick Start tutorial with my own data, and I notice for a few time series the methods model.get_summary
and model.plot_diagnostics
hang, taking over 30 minutes before I kill the cell. I've checked that the time series is uniformly spaced in time and has all valid values between 0 and 40. I'm not sure what's going on, are there diagnostics I could run? Or do you have intuition why this might be happening? Cheers
Consider adding a diagnostic plot that displays residuals defined as ECDF (from Statsmodels, e.g.) - fitted CDF. This would add value and help assess goodness of fit.
First: Thanks for this package, I like it a lot (and it is much faster than my own implementation)!
My problem is ''block_size'':
If I am interested in annual maxima, it easily happens that the selected blocks traverse "hard boundaries" such as 31.12/01.01 (or 31.08/01.09 if I am interested in school years for instance). It could thus happen (rarely of course) that my annual maxima is attributed to the wrong year). This is of course related to the problem of leap years.
A worst-case scenario would probably be:
A maximum any time in e.g. 2020-12 and another "almost-maximum" at 2021-01-01 03:00. Then for the rest in 2021 no real high value (anywhere near the last two mentioned ones). It then could be that 2021-01-01 03:00 would be counted to the annual block of 2020 (and thus not appear in the extremes). However, I would really prefer to have the value of 2021-01-01 03:00 counted to 2021 and therefore provide another extreme value for 2021.
The date_time_intervals are constructed from my first element in the time series (ts) and the block_size. If I am aware of this (which I am), I can fill up my ts with 0's to my desired start of the year (not with nan's because they will be removed by pyextremes before building the date_time_intervals, and hence I would get even stranger year-periods).
My desired solution:
For my purpose, it would be nice to pass date_time_intervals to pyextremes (get_extremes_block_maxima) directly. This would allow me to have hard boundaries at years.
I could imagine this "problem" is even more severe if one looks at monthly blocks: an average block_size would constantly traverse hard month-boundaries.
Anyway, thanks again, and I would be interested to know if my block_size problem is worth to be considered.
First: thank you for creating and sharing this software.
Describe the bug
When performing a KS test on extremes selected with POT and a generalized Pareto distribution, the location parameter is not given to scipy.stats.kstest
. As a result, test_statistic
is incorrect and the result of the KS test may be incorrect.
To Reproduce
$ wget https://github.com/georgebv/pyextremes-notebooks/raw/master/data/battery_wl.csv
import pyextremes as px
import pandas as pd
data = (
pd
.read_csv("battery_wl.csv", index_col=0, parse_dates=True)
.squeeze('columns')
.sort_index(ascending=True)
.astype(float)
.dropna()
)
data = data.loc[pd.to_datetime("1925"):]
data = data - (data.index.array - pd.to_datetime("1992")) / pd.to_timedelta("365.2425D") * 2.87e-3
model = px.EVA(data=data)
model.get_extremes(
method="POT",
extremes_type="high",
threshold = 1.4,
r = '12H'
)
model.fit_model()
model.test_ks()
output:
Kolmogorov-Smirnov Test
------------------------------------------------------------------
Null hypothesis: data follows genpareto distribution
Alternative hypothesis: data doesn't follow genpareto distribution
Test statistic: D = 0.9906864591234011
p-value: 3.829434059334826e-179
Significance level: 0.05
Critical value: 0.14274059053041283
Critical region: reject null-hypothesis if D > 0.14274059053041283
Test result: data doesn't follow genpareto distribution
------------------------------------------------------------------
Note: Test statistic returns D = 0.9906864591234011
Expected behavior
Compute test statics directly with scipy.stats.kstest
:
from scipy import stats
import functools
# old test statistic
gpcdf = functools.partial(stats.genpareto.cdf,**model.model.fit_parameters) # fixed parameters not included
print(stats.kstest(model.extremes,gpcdf).statistic)
# 0.9906864591234011
# correct test statistic
gpcdf2 = functools.partial(
stats.genpareto.cdf,
**{'loc' if k=='floc' else k:v for k,v in model.distribution.fixed_parameters.items()}, # fixed parameters included
**model.model.fit_parameters
)
print(stats.kstest(model.extremes,gpcdf2).statistic)
# 0.09295193640733063
System:
Python 3.10.4
Operating System: CentOS Linux 7 (Core)
Architecture: x86-64
Hello!
Thanks so much for this package!
I am just wondering if the inclusion of covariates into the parameters will be supported in the future versions.
Hello,
I try to use your package. I pass it a panda series of events. Those events are storms. I provide a time series of storm durations (in hours) and storm begin time (index of the series). There can be several storms at the same time and there are times without storms. Then I run:
model = EVA(pd.Series(durations, np.sort(dates))) # Dates are not important, durations are.
model.get_extremes(method="BM", block_size="10D")
model.plot_extremes()
which ends up with:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (731,) + inhomogeneous part.
I believe you expect data at regular intervals (and so I must fill in some gaps). Correct ? I'll continue fiddling a bit. This issue is more about giving an error message one (novice) can understand.
I was reading through your excellent online documentation and one thing that was missing was a basic description of how your confidence intervals are calculated. I'm assuming this placeholder is where that information will go:
https://github.com/georgebv/pyextremes/blob/master/docs/user-guide/12-confidence-intervals.md
I was wondering if there are plans to complete that documentation page (and the user guide more generally), or is the documentation no longer being actively developed?
THE CODE ARE LISTED AS FOLLOWS:
fig, ax = plot_extremes(
ts=series,
extremes=extremes,
extremes_method="BM",
extremes_type="high",
block_size="365.2425D",
figsize=(10, 4),
ax={"lines.color":"green","font.size": 12,"axes.linewidth": 0.8 }
)
THE ERROR ARE LISTED:
TypeError: invalid type in <class 'dict'> for the 'ax' argument, must be matplotlib Axes object
what is the right form for the argument "ax" and how to change it
Thanks
I used the MATLAB 'gevfit' command to get the location parameter and the the scale parameter. These are the same as those fitted by 'pyextremes'(MCMC AND MLM), but the shape parameter k obtained from 'gevfit' is negative, and the shape parameter k' obtained from 'pyextremes'(MCMC AND MLM) are positive value which is the absolute value of the k by a lucky coincidence. Does the program 'pyextremes'(MCMC AND MLM) not set the display for parameter negative value? Or doI need to set something else?
THANKS! HOPE FOR YOU ANSWER!
I did a block maxima with a block period of one year and I fitted a gumbel distribution (this is snow data FYI). I used a strong confidence interval of 0.999.
block_size="365.2425D"
confidence_interval = 0.999
model = EVA(series)
model.get_extremes(method="BM", block_size=block_size, errors='ignore')
model.fit_model(model='MLE', distribution='gumbel_r')
model.plot_diagnostic(alpha=confidence_interval)
I was excpecting the confidence interval to be more like the one in the quickstart (picture below) https://georgebv.github.io/pyextremes/quickstart/
Do you know why there is a "linear" confidence interval ?
I have scipy 1.10.1 and pyextremes 2.2.7.
The estimated parameters can be seen after fitting the model but is there a way to see parameter estimations with upper and lower bounds as in the get_return_value
method?
Thanks in advance for your effort.
Describe the bug
Example of user guide's threshold selection, althought i ran it successfully.
I create a new environment for it but still fail:
conda create --name EVA_env python=3.10
conda activate EVA_env
pip install pyextremes[full]
specific part
axes = plot_threshold_stability(data, return_period=100, thresholds=np.linspace(0.20, 0.26, 20), progress=True)
ERROοΌ Multi-dimensional indexing (e.g. obj[:, None]
) is no longer supported. Convert to a numpy array before indexing instead.
I tried to run the quick start example on your website and got an error when trying to plot the diagnostics.
You can view the error in the following notebook:
https://github.com/climate-resilient-enterprise/frequency-analysis/blob/master/examples/pyextremes_example.ipynb
It was run on Rocky Linux 8.6, with pyextremes version 2.2.4 and python version 3.10.5.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.