nci / scores Goto Github PK

View Code? Open in Web Editor NEW

18.0 7.0 9.0 10.03 MB

scores: metrics for the verification, evaluation and optimisation of forecasts, predictions or models

Home Page: https://scores.readthedocs.io/

License: Other

Python 3.43% Jupyter Notebook 96.57% Shell 0.01%

dask forecast-evaluation forecast-verification forecasting model-validation pandas python xarray climate oceanography

scores's Introduction

Scores: Verification and Evaluation for Forecasts and Models

Notice: scores is undergoing final testing and review. When this is completed, this notice will be removed.

A list of over 50 metrics, statistical techniques and data processing tools contained in scores is available here.

scores is a Python package containing mathematical functions for the verification, evaluation, and optimisation of forecasts, predictions or models. It primarily supports the meteorological, climatological and geoscience communities. In addition to supporting the Earth system science communities, it also has wide potential application in machine learning and other domains.

Documentation is hosted at scores.readthedocs.io.
Source code is hosted at github.com/nci/scores.
The tutorial gallery is hosted at as part of the documentation, here.

Overview

Here is a curated selection of the metrics, tools and statistical tests included in scores:

	Description	Selection of Included Functions
Continuous	Scores for evaluating single-valued continuous forecasts.	Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Additive Bias, Multiplicative Bias, Pearson's Correlation Coefficient, Flip-Flop Index, Quantile loss, Murphy score.
Probability	Scores for evaluating forecasts that are expressed as predictive distributions, ensembles, and probabilities of binary events.	Brier Score, Continuous Ranked Probability Score (CRPS) for Cumulative Density Function (CDF), Threshold weighted CRPS for CDF, CRPS for ensembles, Receiver Operating Characteristic (ROC), Isotonic Regression (reliability diagrams).
Categorical	Scores for evaluating forecasts based on categories.	Probability of Detection (POD), False Alarm Ratio (FAR), Probability of False Detection (POFD), Success Ratio, Accuracy, Peirce's Skill Score, Critical Success Index (CSI), Gilbert Skill Score, Heidke Skill Score, Odds Ratio, Odds Ratio Skill Score, F1 score, FIxed Risk Multicategorical (FIRM) Score.
Statistical Tests	Tools to conduct statistical tests and generate confidence intervals.	Diebold Mariano.
Processing Tools	Tools to pre-process data.	Data matching, Discretisation, Cumulative Density Function Manipulation.

scores not only includes common scores (e.g. MAE, RMSE), it includes novel scores not commonly found elsewhere (e.g. FIRM, Flip-Flop Index), complex scores (e.g. threshold weighted CRPS), and statistical tests (such as the Diebold Mariano test). Additionally, it provides pre-processing tools for preparing data for scores in a variety of formats including cumulative distribution functions (CDF). scores provides its own implementations where relevant to avoid extensive dependencies.

scores primarily supports xarray datatypes for Earth system data allowing it to work with NetCDF4, HDF5, Zarr and GRIB data sources among others. scores uses Dask for scaling and performance. Some metrics work with pandas and we will aim to expand this capability.

All of the scores and metrics in this package have undergone a thorough scientific review. Every score has a companion Jupyter Notebook tutorial that demonstrates its use in practice.

Contributing

To find out more about contributing, see our Contributing Guide.

All interactions in discussions, issues, emails and code (e.g. merge requests, code comments) will be managed according to the expectations outlined in the code of conduct and in accordance with all relevant laws and obligations. This project is an inclusive, respectful and open project with high standards for respectful behaviour and language. The code of conduct is the Contributor Covenant, adopted by over 40,000 open source projects. Any concerns will be dealt with fairly and respectfully, with the processes described in the code of conduct.

Using This Package

The installation guide describes four different use cases for installing, using and working with this package.

Most users currently want the all installation option. This includes the mathematical functions (scores, metrics, statistical tests etc.), the tutorial notebooks and development libraries.

From a Local Checkout of the Git Repository

> pip install -e .[all]

Here is a short example of the use of scores:

> import scores
> forecast = scores.sample_data.simple_forecast()
> observed = scores.sample_data.simple_observations()
> mean_absolute_error = scores.continuous.mae(forecast, observed)
> print(mean_absolute_error)
<xarray.DataArray ()>
array(2.)

To install the mathematical functions ONLY (no tutorial notebooks, no developer libraries), use the minimal installation option. minimal is a stable version with limited dependencies and can be installed from the Python Package Index.

> pip install scores

Finding, Downloading and Working With Data

scores can be used with a broad variety of data sources. See the Data Sources page and this tutorial for more information on finding, downloading and working with different sources of data.

Acknowledging This Work

If you find this work useful, please consider citing or acknowledging it. A citable DOI is coming soon. This section will be updated in the coming weeks to include the correct citation.

scores's People

Contributors

Stargazers

Watchers

Forkers

andrewdhicks john-sharples reza-armuei tennlee steph-chong nicholasloveday bethebert

scores's Issues

Make pre-commit linters scan all files.

GitHub actions improvment
I plan to fix this myself, jsut adding this as an issue so it doesn't get forgotten:
pre-commit command in github actions should be changed to pre-commit run -a so as to scan all files. However many mypy and lint issues need to be solved first, and there are currently outstanding PRs to fix some of these issues. I will make this change once those PRs are merged.

Update maintainer notes following the 0.4 process

Updating the procedure for how to do a scores release following some insights from v0.4 release

Add FIRM score

Add the scoring function in Taggart, R., Loveday, N. and Griffiths, D., 2022. A scoring framework for tiered warnings and multicategorical forecasts based on fixed risk measures. Quarterly Journal of the Royal Meteorological Society, 148(744), pp.1389-1406.

Update documentation to mention how GRIB data can be used

GRIB data is compatible with xarray/scores and is still widely used in the Atmospheric Science.

We should mention somewhere (perhaps the ReadMe) how to do it.

e.g.

use engine='cfgrib' when opening grib file with xarray.

Requires cfgrib https://github.com/ecmwf/cfgrib

Align develop branch with latest main

Add `mypy` support for type hints

Tasks

Add mypy to pre-commit config
Make sure mypy runs without issues.

Add static code analysis to GitHub Actions

Is your feature request related to a problem? Please describe.
To ensure consistent style and code quality we need to add some static code checks. Requested feature is to add new github action to run static code analysis, including black, pylint, isort, and mypy.

Describe the solution you'd like
New GitHub actions to run above mentioned tools automatically on push and pull-request.

Happy to be assigned to this task

Add type hints for public functions

Add type hints to all public function functions (nonprivate functions will be done in a separate issue)

BUG - MAE method doesn't like pd.Series when using dimension preservation

Hi folks, not sure if this is a bug but i thought i'd drop it in here anyway...

I am trying to do simple MAE between two pandas Series data using scores.continuous.mae, which the docstrings tells me is kosher, and trying to preserve dimensions. When i do this i get an AttributeError: 'Series' object has no attribute 'dims'

trace:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_2250677/1926751918.py in ?()
----> 1 scores.continuous.mae(loaded_site_forecast.data.siteforecast_air_temperature, loaded_site_forecast.data.observations_temperature_at_screen_level, preserve_dims='all')

~/mambaforge/envs/site/lib/python3.11/site-packages/scores/continuous.py in ?(fcst, obs, reduce_dims, preserve_dims, weights)
    141     ae = abs(error)
    142     ae = scores.functions.apply_weights(ae, weights)
    143 
    144     if preserve_dims is not None or reduce_dims is not None:
--> 145         reduce_dims = scores.utils.gather_dimensions(fcst.dims, obs.dims, reduce_dims, preserve_dims)
    146 
    147     if reduce_dims is not None:
    148         _ae = ae.mean(dim=reduce_dims)

~/mambaforge/envs/site/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, name)
   6198             and name not in self._accessors
   6199             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6200         ):
   6201             return self[name]
-> 6202         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'dims'

I expect that this is purely that no-one has ever run a pd.Series through this pathway before and so the .dim hasn't been an issue...

Add ROC curve

Also include concave ROC curve calculation once isotonic regression has been completed.

update handling when preserve_dims=all

When preserve_dims="all", gather_dimensions returns an empty set.

In a scoring function, the dimensions to reduce is an empty set (as we don't want to reduce any dimensions), however dataarray.mean() reduces everything which is the opposite to what we want.

add left or right endpoints option to FIRM

At the moment you can't choose if you have left or right endpoints for the decision thresholds in the FIRM score. This would be good to add.

fix mypy errors in firm score

multicategorical_impl.py:109: error: Argument 5 to "_single_category_score" has incompatible type "float | None"; expected "float" [arg-type]
multicategorical_impl.py:117: error: Item "int" of "Dataset | Literal[0]" has no attribute "mean" [union-attr]
multicategorical_impl.py:210: error: Incompatible types in assignment (expression has type "int", variable has type "ndarray[Any, dtype[Any]]") [assignment]
multicategorical_impl.py:211: error: Incompatible types in assignment (expression has type "int", variable has type "ndarray[Any, dtype[Any]]") [assignment]
Found 4 errors in 1 file (checked 1 source file)

clean up utils_test_data

utils_test_data.py needs cleaning up. It contains some code that is commented out as well as some test objects that I don't think are used e.g., utils_test_data.DA_1

Update metrics in readme

The readme is missing a couple of metrics in the table and should be updated.

Add nbmake or equivalent pytest extension for notebook e2e tests

Add nbmake stage for notebook e2e tests to ensure notebooks don’t break during PRs.

https://github.com/treebeardtech/nbmake

Clean up docstring formatting

Tasks

Make docstring format consistent with the Google style (Napoleon) with 72 character limits per line.
Add section to contributing about using the Napoleon format.

Add "under construction" notice to the README

"scores" is undergoing initial setup and maintenance. Information on this process should be placed in the README.

Handle weightings argument to scores

The weightings keyword argument was added as standard to function signatures but not implemented. This is intended to allow a weightings array to be passed through, representing things like area averaging, population density weighting or another kind of importance or significance weighting. Requirements need to first be developed more clearly and then the functionality implemented.

Add Murphy score

rename diebold mariano tutorial

Add the Modified Diebold-Mariano test statistic

We need to migrate the Modified Diebold-Mariano test statistic across.

Consider renaming stats tests dir

Currently we have scores.stats.tests for statistical tests. This may be confusing since it's not a directory for pytests and it is worth considering other options.

refactor test suite to act like regular modules

Details here:
https://docs.pytest.org/en/7.1.x/explanation/pythonpath.html#test-modules-conftest-py-files-inside-packages

will also cut down the unnecessary directories, relative imports and relative paths in the suite.

Add RMSE

Add root mean squared error to continuous

Address code coverage

Lift code coverage from 99 to 100%
Re-introduce a 100% coverage requirement

Fix mypy issue with murphy typehints

murphy_impl.py:93: error: Incompatible types in assignment (expression has type "str", variable has type "Literal['quantile', 'huber', 'expectile']") [assignment]

Add Root Mean Squared Error

Add an RMSE score, as described here

This will follow the design pattern evident in scores.continuous.mse

Address small test coverage gap

Coverage is currently around 99%. This addresses a few uncovered lines.

replace setuptools with hatchling for pkg build

make diebold mariano work nicely with dask

Currently the Diebold Mariano test statistic brings data into memory due to the autocovariance calculation.

We should see if we can wrap this function so that it that it works nicely with dask using something like https://docs.xarray.dev/en/stable/user-guide/dask.html#automatic-parallelization-with-apply-ufunc-and-map-blocks

Add tests to confirm dask compatability

We need some tests that test that when forecast and observation data is a chunked xarray object, that the metric can be called and the data isn't brought into memory before .compute is called.

These should be applied to each metric.

update docstrings for with weighting info

In various parts of the code, it still says the weighting hasn't been implemented when it actually has https://github.com/nci/scores/blob/develop/src/scores/continuous.py#L32

We need to go through and update these docstrings.

Correction to example in README

The README instructions are slightly incorrect with regards to the sample_data API

Add type hints for all functions

Add type hints for all functions (not just the public functions)

remove reduce_dims arg from gather_dimensions

Improve round_values docstring

The docstring in https://github.com/nci/scores/blob/develop/src/scores/probability/functions.py#L19 needs to be clearer to ensure that the user understands that the rounding_precision arg is different to specifying how many places one wants to round to.

JOSS paper draft

The intention is to prepare a paper for submission to the Journal of Open Source Software (JOSS), see https://joss.readthedocs.io/en/latest/submitting.html.

This issue is to track the development of 'paper.md', the markdown behind the paper which is submitted to JOSS.

Set up readthedocs for this package

The intention is to use readthedocs for documentation hosting, including auto generated API documentation.

Add FlipFlop Index

Clarify and refine dimension gathering rules

Dimensions handling plain language requirements:
• We accept xarray broadcasting rules
• Sets, lists and tuples are all acceptable here
• Strings supplied to preserve_dims or reduces_dims convert to a string in a list
• If 'all' is specified, and there is a dimension in the data called 'all', do what would normally be done but raise a warning and explain how to control the behaviour by putting 'all' inside a list instead
• If reduce_dims and preserve_dims are both non-null, raise an exception
• (default) If reduce_dims and preserve_dims are both null, reduce all dimensions
• If reduce_dims is [], then reduce nothing
• If reduce_dims is 'all', then reduce everything
• If preserve_dims = [], then reduce everything
• If preserve_dims = 'all', then reduce nothing
• If a dimension is present in reduce_dims or preserve_dims but this dimension does not actually appear in the data, throw an exception
• If reduce_dims = [“a”, “b”], reduce dimensions “a” and “b” (after broadcasting if applicable)
If preserve_dims = [“a”, “b”], reduce all dimensions that are not “a” and “b” (after broadcasting if applicable). Need to test the special case where all dimensions within fcst/obs are specified in the list to avoid the bug in point 2 of #18

add ability to handle angular data

This will make things like the Flip-Flop index #54 easier to migrate into scores

Pull request checklist

It would be nice if contributors had a checklist of things to do/check when creating a pull request for a new metric that aligns with the contributor guide https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/creating-a-pull-request-template-for-your-repository

Ideally it would be nice to have multiple templates to cover pull requests that aren't adding a new score.

Add isotonic regression

Implement isotonic regression.
This is relevant to both continuous and probabilistic forecasts of binary outcomes.

Explore weighted scores for Pandas data

The initial implementation of weightings works for xarray data but doesn't consider other data types.

dim handling issues

I did some testing of dim handling using mse from main in a jupyter notebook.

Here is some code and problems I found.

fcst = xr.DataArray(
    data=[[1., 2, 3], [2, 3, 4]],
    dims=['stn', 'date'],
    coords=dict(stn=[101, 102], date=['2022-01-01', '2022-01-02', '2022-01-03'])
)
obs = xr.DataArray(
    data=[[0., 2, 4], [.5, 2.2, 3.5]],
    dims=['source', 'date'],
    coords=dict(source=['a', 'b'], date=['2022-01-01', '2022-01-02', '2022-01-03'])
)

mse(fcst, obs, reduce_dims=[]) returns a single value (i.e. reduces all dimensions). It would be preferable if it reduced no dimensions.
mse(fcst, obs, preserve_dims=['source', 'date', 'stn']) returns a single value (i.e. reduces all dimensions) rather than an array with all dimensions preserved. Same as mse(fcst, fcst, preserve_dims=fcst.dims)
The docstring for mse says, for 'preserve_dims', that "the forecast and observed dimensions must match precisely". This can be removed as it works perfectly fine if they don't match (using usual xarray broadcasting) and I don't think it is even desirable that fcst and obs have matching dimensions.

Create v0.0.2 for PyPI update

Add missing dask to runtime dependencies

Tasks

Add dask (optional: pinned version) to core dependencies

Update gather_dimensions to handle when both reduce_dims and preserve_dims are None

Currently utils.gather_dimensions doesn't handle the case when both reduce_dims and preserve_dims are both None.

In this case it should return set(all_dims)

Migrate pylintrc to pyproject.toml

The configuration settings in pylintrc should be moved into pyproject.toml in order to adopt a consistent approach and also reduce the number of configuration files at the top level

nci / scores Goto Github PK

scores's Introduction

Scores: Verification and Evaluation for Forecasts and Models

Overview

Contributing

Using This Package

Finding, Downloading and Working With Data

Acknowledging This Work

scores's People

Contributors

Stargazers

Watchers

Forkers

scores's Issues

Tasks

Tasks

Recommend Projects

Recommend Topics

Recommend Org