Coder Social home page Coder Social logo

georgebv / pyextremes Goto Github PK

View Code? Open in Web Editor NEW
233.0 6.0 45.0 6.29 MB

Extreme Value Analysis (EVA) in Python

Home Page: https://georgebv.github.io/pyextremes/

License: MIT License

Python 99.76% Shell 0.24%
extreme-value-analysis extreme-value-statistics statistics extremes extreme-events python eva block-maxima peaks-over-threshold

pyextremes's Introduction

pyextremes's People

Contributors

eastjames avatar ecomodeller avatar georgebv avatar shenxiangzhuang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pyextremes's Issues

Digital Object Identifier for pyextremes

Is your feature request related to a problem? Please describe.
I work in academic space science research. When publishing research, journals are becoming more strict about citing the code that has been used, often requiring a digital object identifier (DOI). See example here:
https://www.agu.org/publish-with-agu/publish/author-resources/data-and-software-for-authors
Scroll to "Data & Software Citation":
"
Citations should include:

  1. Author(s) or project name(s)
  2. Date / Software published
  3. Title / Software name
  4. Data or software release/version (optional)
  5. Bracketed description type (e.g., [Dataset], [Software], [Collection], [ComputationalNotebook])
  6. Repository name / Publication venue
  7. DOI, persistent identifier, URL
  8. Retrieved date (required when using URL)

"

Describe the solution you'd like
It would be really great if you could generate a DOI for new releases of pyextremes. It's quite easy to do, instructions here: https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content. An added benefit is it would give you an indication of where/how your great package is being used in academic literature.

Error in plot_mean_residual_life for scipy v1.11.2

Describe the bug
After updating to scipy v1.11.2 I get an error when executing plot_mean_residual_life(data)

To Reproduce
Steps to reproduce the behavior:

  1. Update scipy
  2. Import example data set batteries
  3. try to execute plot_mean_residual_life(data)

Expected behavior
I expected to see the plot as provided in the tutorial

Screenshots

127     mean_residual_lives.append(exceedances.mean())
128     if alpha is not None:
129         mrl_confidence.append(

--> 130 scipy.stats.norm.interval(
131 alpha=alpha,
132 loc=exceedances.mean(),
133 scale=exceedances.std(ddof=1) / np.sqrt(len(exceedances)),
134 )
135 )
137 with plt.rc_context(rc=pyextremes_rc):
138 if ax is None:

TypeError: interval() missing 1 required positional argument: 'confidence'

Desktop (please complete the following information):

  • macOS
  • Python version 3.9.16

Additional context
Thanks!

BUG: Results are not matched with ismev for MLE

Describe the bug
I read a book about extreme value statistics, and found this package. Then, I want to validate the same results are obtained or not.
The resutlts were finally not matched...

import pandas as pd
from pyextremes import EVA
KobeM = pd.Series([129.7,  51.4, 148.8,  76.3,  68.3,  91.8, 116.5,  97.4,  81.3,  65.8,  50.9, 100.2,
  123.2, 199.4, 137.7, 104.7,  72.6,  63.2, 116.5,  71.9, 121.6,  87.8,  89.2,  76.2,
  103.7, 112.0,  60.7,  63.2, 119.7,  75.5,  68.3,  67.6,  62.3,  51.1,  62.5,  99.4,
   81.4,  86.3, 143.8,  44.5,  58.4, 270.4, 103.5,  94.0,  68.9,  80.0, 112.5,  73.4,
  262.8,  58.1,  43.3,  59.0, 181.8, 117.1, 109.4, 115.9, 116.2, 175.8,  73.4,  83.7,
  165.6,  56.8, 123.4, 142.2, 195.2, 130.1,  64.5,  97.1, 219.4,  95.7, 319.4,  72.5,
  115.5,  53.0,  81.0, 152.0,  68.5,  83.5,  77.0,  77.5,  67.5,  91.0,  86.5,  77.0,
   57.0, 125.5, 197.0, 105.5,  75.5, 128.0,  70.0, 150.0, 121.5,  72.0,  57.0,  81.0,
   86.0,  40.5, 118.5,  76.5,  82.5, 122.0, 179.5, 123.0,  71.5,  38.5, 108.5, 138.0,
   57.0,  76.5,  83.0,  47.0,  69.0, 105.5, 124.5,  64.0, 100.5])

KobeM.index = [pd.Timestamp(i,1,1) for i in range(1897, 2014)]

model = EVA(KobeM)

model.get_extremes(method="BM", block_size='365.2425D')

model.fit_model()

                           Univariate Extreme Value Analysis                            
========================================================================================
                                      Source Data                                       
----------------------------------------------------------------------------------------
Data label:                          None      Size:                                 117
Start:                       January 1897      End:                         January 2013
========================================================================================
                                     Extreme Values                                     
----------------------------------------------------------------------------------------
Count:                                116      Extraction method:                     BM
Type:                                high      Block size:             365 days 05:49:12
========================================================================================
                                         Model                                          
----------------------------------------------------------------------------------------
Model:                                MLE      Distribution:                  genextreme
Log-likelihood:                  -583.328      AIC:                             1172.869
----------------------------------------------------------------------------------------
Free parameters:                 c=-0.192      Fixed parameters: All parameters are free
                               loc=78.353                                               
                             scale=28.215                                               
========================================================================================

R's ismev's code is this

library(ismev)
KobeM <- c(129.7,  51.4, 148.8,  76.3,  68.3,  91.8, 116.5,  97.4,  81.3,  65.8,  50.9, 100.2,
  123.2, 199.4, 137.7, 104.7,  72.6,  63.2, 116.5,  71.9, 121.6,  87.8,  89.2,  76.2,
  103.7, 112.0,  60.7,  63.2, 119.7,  75.5,  68.3,  67.6,  62.3,  51.1,  62.5,  99.4,
   81.4,  86.3, 143.8,  44.5,  58.4, 270.4, 103.5,  94.0,  68.9,  80.0, 112.5,  73.4,
  262.8,  58.1,  43.3,  59.0, 181.8, 117.1, 109.4, 115.9, 116.2, 175.8,  73.4,  83.7,
  165.6,  56.8, 123.4, 142.2, 195.2, 130.1,  64.5,  97.1, 219.4,  95.7, 319.4,  72.5,
  115.5,  53.0,  81.0, 152.0,  68.5,  83.5,  77.0,  77.5,  67.5,  91.0,  86.5,  77.0,
   57.0, 125.5, 197.0, 105.5,  75.5, 128.0,  70.0, 150.0, 121.5,  72.0,  57.0,  81.0,
   86.0,  40.5, 118.5,  76.5,  82.5, 122.0, 179.5, 123.0,  71.5,  38.5, 108.5, 138.0,
   57.0,  76.5,  83.0,  47.0,  69.0, 105.5, 124.5,  64.0, 100.5)
k000 <- gev.fit(KobeM)

results are

$conv
[1] 0
$nllh
[1] 588.2654
$mle
[1] 77.8499914 28.1150960  0.1970488
$se
[1] 2.96183720 2.35426533 0.07809392

Expected behavior
I think number of iterations are not enough for this fitting.
Also, I wonder when estimating parameters, what kind of pacakges is used for parameter optimization.,,

Desktop (please complete the following information):

  • OS : Linux
  • Python version: 3.10.4

Make pyextremes citable with zenodo?

Pyextremes is useful and flexible and it has been helpful in my research. Can it be made citable?

For example, github provides the option for creating persistent identifiers through zenodo: https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content

If you are not interested in creating the persistent identifier for the repository, is there another citation you could suggest for your software? In addition to the Coles 2001 extreme value theory text.

Thank you!

Error in plot_parameter_stability()

I have a continuous column data and wanted to plot plot_parameter_stability() from pyextremes.
But I am getting an error. When I used the same column for plot_mean_residual_life() it outputted the plot perfectly.

Error:
I had hidden the column name for confidential purposes
image

Sample values of the data:
image

Code was run on the following versions:

  • Python version : 3.9.16
  • pyextremes - 2.3.1

Please help.

Extracting confidence intervals on fit parameters 'c', 'loc', and 'scale'

Hi,

Thanks for creating this super helpful python package, it's really easy to use and the documentation are really helpful. I use this package for space science research.

I'm interested in returning the confidence intervals or some other measure of error on the fit parameters 'c', 'loc', and 'scale'.

Currently I access the fit parameters by doing: EVA.model.fit_parameters . Having confidence intervals (or any other measure of error) on these parameters would be a great addition to the package. Apologies if this feature has already been implemented - if it has please could you guide me on how to extract the confidence intervals / errors?

Thanks,
Alexandra Fogg

Long timeseries support: thinking beyond pandas datetime range

Hi

I have been trying to use the 700 years worth of time series and the pandas apparently supports at max 584 years for datetime with the unit ns. I was trying to compute the return value stability plot (using pyextremesv2.3.0, and a POT approach) and getting the following error:

image

Is there any workaround for this? I understand that the function takes a pandas series with the date-time object as its index.

my series looks like this:
image

Any help is appreciated.

Kind regards,
Dhirendra

Add API description

Unless I'm mistaken, there's no API published. It makes exploring the package difficult. For example, it was not obvious to me where to find the fitted distribution's parameters. I had to look at your code to find out :-) (fortunately, it's very well written so it wasn't hard :-) )

Multiprocessing in MLE model prevents use on AWS Lambda Functions

(Firstly I'll just say that this is a fantastic library and very easy to use, great job!)

For whatever reason, AWS lambda functions cannot use multiprocessing.Pool(). This is a problem as it is used by pyextremes when fitting an MLE model. So when I attempt to run my AWS lambda function, it crashes, and the current solution is for me to modify line 199 in model_mle.py to if True, thereby ensuring it does not use the multiprocessing approach.

This is not ideal. It's slightly tedious, but AWS describes this method for using multiprocessing that is compatible with lambda functions. (Instead of using Pool you use Process and Pipe with some additional code to set it up.) Alternatively you could just add a kwarg to avoid the multiprocessing method which would be fine too for those of us not concerned with speed, and much easier to implement.

model.get_summary and model.plot_diagnostics taking a long time

Hi

I'm following your Quick Start tutorial with my own data, and I notice for a few time series the methods model.get_summary and model.plot_diagnostics hang, taking over 30 minutes before I kill the cell. I've checked that the time series is uniformly spaced in time and has all valid values between 0 and 40. I'm not sure what's going on, are there diagnostics I could run? Or do you have intuition why this might be happening? Cheers

alternative to block_size

First: Thanks for this package, I like it a lot (and it is much faster than my own implementation)!

My problem is ''block_size'':
If I am interested in annual maxima, it easily happens that the selected blocks traverse "hard boundaries" such as 31.12/01.01 (or 31.08/01.09 if I am interested in school years for instance). It could thus happen (rarely of course) that my annual maxima is attributed to the wrong year). This is of course related to the problem of leap years.

A worst-case scenario would probably be:
A maximum any time in e.g. 2020-12 and another "almost-maximum" at 2021-01-01 03:00. Then for the rest in 2021 no real high value (anywhere near the last two mentioned ones). It then could be that 2021-01-01 03:00 would be counted to the annual block of 2020 (and thus not appear in the extremes). However, I would really prefer to have the value of 2021-01-01 03:00 counted to 2021 and therefore provide another extreme value for 2021.

The date_time_intervals are constructed from my first element in the time series (ts) and the block_size. If I am aware of this (which I am), I can fill up my ts with 0's to my desired start of the year (not with nan's because they will be removed by pyextremes before building the date_time_intervals, and hence I would get even stranger year-periods).

My desired solution:
For my purpose, it would be nice to pass date_time_intervals to pyextremes (get_extremes_block_maxima) directly. This would allow me to have hard boundaries at years.

I could imagine this "problem" is even more severe if one looks at monthly blocks: an average block_size would constantly traverse hard month-boundaries.

Anyway, thanks again, and I would be interested to know if my block_size problem is worth to be considered.

KS test gives incorrect test_statistic

First: thank you for creating and sharing this software.

Describe the bug
When performing a KS test on extremes selected with POT and a generalized Pareto distribution, the location parameter is not given to scipy.stats.kstest. As a result, test_statistic is incorrect and the result of the KS test may be incorrect.

To Reproduce

$ wget https://github.com/georgebv/pyextremes-notebooks/raw/master/data/battery_wl.csv
import pyextremes as px
import pandas as pd


data = (
    pd
    .read_csv("battery_wl.csv", index_col=0, parse_dates=True)
    .squeeze('columns')
    .sort_index(ascending=True)
    .astype(float)
    .dropna()
)
data = data.loc[pd.to_datetime("1925"):]
data = data - (data.index.array - pd.to_datetime("1992")) / pd.to_timedelta("365.2425D") * 2.87e-3

model = px.EVA(data=data)

model.get_extremes(
    method="POT",
    extremes_type="high",
    threshold = 1.4,
    r = '12H'
)

model.fit_model()

model.test_ks()

output:

                     Kolmogorov-Smirnov Test                      
------------------------------------------------------------------
Null hypothesis: data follows genpareto distribution              
Alternative hypothesis: data doesn't follow genpareto distribution
                                                                  
Test statistic: D = 0.9906864591234011                            
p-value: 3.829434059334826e-179                                   
                                                                  
Significance level: 0.05                                          
Critical value: 0.14274059053041283                               
Critical region: reject null-hypothesis if D > 0.14274059053041283
                                                                  
Test result: data doesn't follow genpareto distribution           
------------------------------------------------------------------

Note: Test statistic returns D = 0.9906864591234011

Expected behavior
Compute test statics directly with scipy.stats.kstest:

from scipy import stats
import functools

# old test statistic
gpcdf = functools.partial(stats.genpareto.cdf,**model.model.fit_parameters) # fixed parameters not included
print(stats.kstest(model.extremes,gpcdf).statistic)
# 0.9906864591234011

# correct test statistic
gpcdf2 = functools.partial(
    stats.genpareto.cdf,
    **{'loc' if k=='floc' else k:v for k,v in model.distribution.fixed_parameters.items()}, # fixed parameters included
    **model.model.fit_parameters
)
print(stats.kstest(model.extremes,gpcdf2).statistic)
# 0.09295193640733063

System:
Python 3.10.4

  Operating System: CentOS Linux 7 (Core)
      Architecture: x86-64

Support of covariates

Hello!

Thanks so much for this package!

I am just wondering if the inclusion of covariates into the parameters will be supported in the future versions.

What's the meaning of "ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions"

Hello,

I try to use your package. I pass it a panda series of events. Those events are storms. I provide a time series of storm durations (in hours) and storm begin time (index of the series). There can be several storms at the same time and there are times without storms. Then I run:

model = EVA(pd.Series(durations, np.sort(dates))) # Dates are not important, durations are.
model.get_extremes(method="BM", block_size="10D")
model.plot_extremes()

which ends up with:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (731,) + inhomogeneous part.

I believe you expect data at regular intervals (and so I must fill in some gaps). Correct ? I'll continue fiddling a bit. This issue is more about giving an error message one (novice) can understand.

How are your confidence intervals calculated?

I was reading through your excellent online documentation and one thing that was missing was a basic description of how your confidence intervals are calculated. I'm assuming this placeholder is where that information will go:
https://github.com/georgebv/pyextremes/blob/master/docs/user-guide/12-confidence-intervals.md

I was wondering if there are plans to complete that documentation page (and the user guide more generally), or is the documentation no longer being actively developed?

change visuals of the plot

THE CODE ARE LISTED AS FOLLOWS:

fig, ax = plot_extremes(
    ts=series,
    extremes=extremes,
    extremes_method="BM",
    extremes_type="high",
    block_size="365.2425D",
    figsize=(10, 4),
    ax={"lines.color":"green","font.size": 12,"axes.linewidth": 0.8 }
)

THE ERROR ARE LISTED:
TypeError: invalid type in <class 'dict'> for the 'ax' argument, must be matplotlib Axes object

what is the right form for the argument "ax" and how to change it

Thanks

On GEV maximum likelihood estimation and MCMC with pyextremes

I used the MATLAB 'gevfit' command to get the location parameter and the the scale parameter. These are the same as those fitted by 'pyextremes'(MCMC AND MLM), but the shape parameter k obtained from 'gevfit' is negative, and the shape parameter k' obtained from 'pyextremes'(MCMC AND MLM) are positive value which is the absolute value of the k by a lucky coincidence. Does the program 'pyextremes'(MCMC AND MLM) not set the display for parameter negative value? Or doI need to set something else?

THANKS! HOPE FOR YOU ANSWER!

Confidence interval question

I did a block maxima with a block period of one year and I fitted a gumbel distribution (this is snow data FYI). I used a strong confidence interval of 0.999.

block_size="365.2425D"
confidence_interval = 0.999
model = EVA(series)
model.get_extremes(method="BM", block_size=block_size, errors='ignore')
model.fit_model(model='MLE', distribution='gumbel_r')
model.plot_diagnostic(alpha=confidence_interval)

image

I was excpecting the confidence interval to be more like the one in the quickstart (picture below) https://georgebv.github.io/pyextremes/quickstart/
image

Do you know why there is a "linear" confidence interval ?

I have scipy 1.10.1 and pyextremes 2.2.7.

Multi-dimensional indexing (e.g. `obj[:, None]`) is no longer supported. Convert to a numpy array before indexing instead.

Describe the bug
Example of user guide's threshold selection, althought i ran it successfully.
I create a new environment for it but still fail:
conda create --name EVA_env python=3.10
conda activate EVA_env
pip install pyextremes[full]

specific part
axes = plot_threshold_stability(data, return_period=100, thresholds=np.linspace(0.20, 0.26, 20), progress=True)
ERRO: Multi-dimensional indexing (e.g. obj[:, None]) is no longer supported. Convert to a numpy array before indexing instead.

but Mean Residual Life can be ploted.
ML89_FNGVWGD{BWK( Y9DSV

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.