jacobgil / confidenceinterval Goto Github PK
View Code? Open in Web Editor NEWThe long missing library for python confidence intervals
License: MIT License
The long missing library for python confidence intervals
License: MIT License
The metrics with random three letter acronyms are all nice and cryptic, but I believe it would be beneficial to add basic description of the methods in the docstrings while also mentioning the full names of the methods.
I may know what a FPR is, but some of those metrics I had to google.
Also, I was not able to find metrics like specificity, sensitivity nor precision, which I consider as the most wanted.
Thanks for your work anyways, I believe you are doing great job.
If I knew how bootstrapping works and stuff, I would offer to contribute, but I dont have too much time to study that rn, so maybe later.
Aloha @jacobgil,
Concern:
Is it possible to implement stratified sampling for use in the bootstrapping process? In sklearn.utils.resample
there is an extra parameter stratify
which takes an array of the same shape of the data and samples the data in proportion to the stratify parameter. The current method bootstraps indices which are not linked to the class of the data.
Attempt at solution:
from scipy.stats import bootstrap
import numpy as np
from typing import List, Callable, Optional, Tuple
from sklearn.utils import resample
from dataclasses import dataclass
@dataclass
class BootstrapResult:
bootstrap_distribution: np.ndarray
bootstrap_methods = [
'bootstrap_bca',
'bootstrap_percentile',
'bootstrap_basic']
class BootstrapParams:
n_resamples: int
random_state: Optional[np.random.RandomState]
def bootstrap_ci(y_true: List[int],
y_pred: List[int],
metric: Callable,
confidence_level: float = 0.95,
n_resamples: int = 9999,
method: str = 'bootstrap_bca',
random_state: Optional[np.random.RandomState] = None,
strata: Optional[List[int]] = None) -> Tuple[float, Tuple[float, float]]:
def statistic(*indices):
indices = np.array(indices)[0, :]
#return metric(np.array(y_true)[indices], np.array(y_pred)[indices])
try:
return metric(np.array(y_true)[indices], np.array(y_pred)[indices])
except:
print('I failed lol', indices, np.unique(np.array(y_true)[indices]))
pass
assert method in bootstrap_methods, f'Bootstrap ci method {method} not in {bootstrap_methods}'
indices = (np.arange(len(y_true)), )
bootstrap_distribution = [metric(*resample(y_true, y_pred, stratify=y_true))
for _ in range(n_resamples)]
bootstrap_res_test = BootstrapResult(bootstrap_distribution=np.array(bootstrap_distribution))
#print(bootstrap_res)
bootstrap_res = bootstrap(indices,
statistic=statistic,
n_resamples=0,
confidence_level=confidence_level,
method=method.split('bootstrap_')[1],
bootstrap_result=bootstrap_res_test,
random_state=random_state)
#print(bootstrap_res.bootstrap_distribution)
np.testing.assert_equal(bootstrap_res.bootstrap_distribution, bootstrap_res_test.bootstrap_distribution)
result = metric(y_true, y_pred)
ci = bootstrap_res.confidence_interval.low, bootstrap_res.confidence_interval.high
return result, ci
The main idea I tried was to use resample
with stratify=y_true
and input that bootstrapped distribution into the scipy.stats.bootstrap
function. This fails when the bootstrap method is not "percentile" because the bootstrap function calls on statistic for the evaluation of the confidence limits.
A more simple example to test when statistic
is called is the following:
from scipy.stats import bootstrap
from dataclasses import dataclass
@dataclass
class BootstrapResult:
bootstrap_distribution: np.ndarray
def noisy_mean(arr):
print("HI", arr)
return np.mean(arr)
bootstrap(([1,2,3,4],), noisy_mean, n_resamples=0, #method='percentile',
bootstrap_result=BootstrapResult(bootstrap_distribution=np.array([5,6,7,8,9])))
Context:
I would like to use this package for multi class AUROC. However, there are no easy to find methods which compute analytical confidence intervals for the one-vs-rest and one-vs-one cases. This means I would use bootstrapping to compute the confidence interval. Sometimes the bootstrapping method would select (randomly) a subset of y_true
with all of the same classes. This happens more frequently with imbalanced datasets (which are common in healthcare). This would break the AUROC (as it is not defined in that case) hence throw an error in my code. It seems like using stratified bootstrapping (where the proportion of classes after resampling stays the same) avoids this issue (because there would be more than one class in the sample). Hence, I would like to introduce this feature. However I am having difficulty actually constructing the solution.
Thank you for this fantastic package. It is very helpful and I believe it to be a new gold standard for ML evaluation.
Great work!
How can I trust the confidence intervals it returns - did you do anykind of validation - maybe comparing results with the standard R implementations?
Would be cool to have like a (small) test report for some examples - in case you are in a regulated environment.
Sensitivity, aka true positive rate, should be calculated consistently accross the library.
I can understand that there will be slight differences when using bootstrap methods for calculating the confidence intervals, but not inconsistency like the one in this minimal working example:
from confidenceinterval.takahashi_methods import recall_score_bootstrap, precision_score_bootstrap
from confidenceinterval.binary_metrics import tnr_score_bootstrap, ppv_score_bootstrap, tpr_score_bootstrap
from numpy.testing import assert_allclose, assert_almost_equal
def get_samples_based_on_tfpn(tp, tn, fp, fn) -> tuple[list[bool], list[int]]:
ground_truth = [1] * tp + [0] * tn + [1] * fn + [0] * fp
predictions = [1] * tp + [0] * tn + [0] * fn + [1] * fp
return ground_truth, predictions
tp_, tn_, fp_, fn_ = 679, 1366, 69, 69
y_true, y_pred = get_samples_based_on_tfpn(tp_, tn_, fp_, fn_)
sensitivity_, sensitivity_ci_ = recall_score_bootstrap(y_true=y_true, y_pred=y_pred, confidence_level=0.95, method='bootstrap_bca')
sensitivity, sensitivity_ci = tpr_score_bootstrap(y_true=y_true, y_pred=y_pred, confidence_level=0.95, method='bootstrap_bca')
assert_almost_equal(sensitivity, sensitivity_, decimal=3, err_msg=f"Sensitivity: {sensitivity} != {sensitivity_}")
AssertionError:
Arrays are not almost equal to 3 decimals
Sensitivity: 0.9077540105738298 != 0.9367842418689877
ACTUAL: 0.9077540105738298
DESIRED: 0.9367842418689877
I have looked into the source code and identified several inconsistencies in the docstrincs, where the terms "sensitivity" and "specificity" were mixed arbitrarily, pointing at unchecked copypasting of code and that's where the error originates. I have no clue where the error lies, though.
The following code runs on the data in predictions.csv and uses 3 methods for a recall
evaluation:
recall_score
from sklearn.metrics
recall_score
from confidenceinterval
The results are different between sklearn
and confidenceinterval
. Is there any explanation for this effect?
import pandas as pd
import confidenceinterval
from sklearn.metrics import recall_score, confusion_matrix
df = pd.read_csv('predictions.csv')
y_true = df['true'].values.astype(bool)
y_pred = (df['pred'].values > 0.5).astype(bool)
# using confidenceinterval
re1 = confidenceinterval.recall_score(y_true, y_pred)[0]
# using sklearn
re2 = recall_score(y_true, y_pred)
# direct calculation from confusion matrix
conf_mat = confusion_matrix(y_true, y_pred)
re3 = conf_mat[1, 1] / (conf_mat[1, 1] + conf_mat[1, 0])print(f'{re1:.4f}, {re2:.4f}, {re3:.4f}')
Results are: 0.7789, 0.7820, 0.7820
Hi,
Thanks for a great library! How come the macro-f1 has been hardcoded with a CI of [0,1] in this line?
confidenceinterval/confidenceinterval/metrics.py
Lines 527 to 528 in 7b47bfd
I believe we should be using the equation from the paper to compute first variance, then computing the CI from that. Thanks!
When running this line from binary_metrics.py
from typing_extensions import Unpack
I get the following error:
ImportError: cannot import name 'Unpack' from 'typing_extensions' (/opt/anaconda3/envs/pytorch_env/lib/python3.8/site-packages/typing_extensions.py)
Wonder if no one else is getting this.
I did a pip install of confidenceinterval
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.