jacobgil / confidenceinterval Goto Github PK

View Code? Open in Web Editor NEW

118.0 118.0 11.0 78 KB

The long missing library for python confidence intervals

License: MIT License

Python 100.00%

data-science machine-learning metrics statistics

confidenceinterval's Introduction

confidenceinterval's People

Contributors

Stargazers

Watchers

Forkers

xdotproduct pankajkarman jxzhangjhu janalipkova mjogan matjazjogan geeks-sid subhankar21 zaijab natsugao7 tiffany-washburn

confidenceinterval's Issues

Add the most basic metrics such as precisioin, specificity, sensitivity and increase docs.

The metrics with random three letter acronyms are all nice and cryptic, but I believe it would be beneficial to add basic description of the methods in the docstrings while also mentioning the full names of the methods.
I may know what a FPR is, but some of those metrics I had to google.

Also, I was not able to find metrics like specificity, sensitivity nor precision, which I consider as the most wanted.

Thanks for your work anyways, I believe you are doing great job.
If I knew how bootstrapping works and stuff, I would offer to contribute, but I dont have too much time to study that rn, so maybe later.

Implementing Stratified Sampling

Aloha @jacobgil,

Concern:

Is it possible to implement stratified sampling for use in the bootstrapping process? In sklearn.utils.resample there is an extra parameter stratify which takes an array of the same shape of the data and samples the data in proportion to the stratify parameter. The current method bootstraps indices which are not linked to the class of the data.

Attempt at solution:

from scipy.stats import bootstrap
import numpy as np
from typing import List, Callable, Optional, Tuple
from sklearn.utils import resample

from dataclasses import dataclass
@dataclass
class BootstrapResult:
    bootstrap_distribution: np.ndarray

bootstrap_methods = [
    'bootstrap_bca',
    'bootstrap_percentile',
    'bootstrap_basic']


class BootstrapParams:
    n_resamples: int
    random_state: Optional[np.random.RandomState]


def bootstrap_ci(y_true: List[int],
                 y_pred: List[int],
                 metric: Callable,
                 confidence_level: float = 0.95,
                 n_resamples: int = 9999,
                 method: str = 'bootstrap_bca',
                 random_state: Optional[np.random.RandomState] = None,
                 strata: Optional[List[int]] = None) -> Tuple[float, Tuple[float, float]]:

    def statistic(*indices):
        indices = np.array(indices)[0, :]
        #return metric(np.array(y_true)[indices], np.array(y_pred)[indices])

        try:
            return metric(np.array(y_true)[indices], np.array(y_pred)[indices])
        except:
            print('I failed lol', indices, np.unique(np.array(y_true)[indices]))
            pass

    assert method in bootstrap_methods, f'Bootstrap ci method {method} not in {bootstrap_methods}'

    indices = (np.arange(len(y_true)), )


    bootstrap_distribution = [metric(*resample(y_true, y_pred, stratify=y_true))
                              for _ in range(n_resamples)]

    bootstrap_res_test = BootstrapResult(bootstrap_distribution=np.array(bootstrap_distribution)) 
    
    #print(bootstrap_res)

    bootstrap_res = bootstrap(indices,
                              statistic=statistic,
                              n_resamples=0,
                              confidence_level=confidence_level,
                              method=method.split('bootstrap_')[1],
                              bootstrap_result=bootstrap_res_test,
                              random_state=random_state)

    #print(bootstrap_res.bootstrap_distribution)
    
    np.testing.assert_equal(bootstrap_res.bootstrap_distribution, bootstrap_res_test.bootstrap_distribution)

    result = metric(y_true, y_pred)
    ci = bootstrap_res.confidence_interval.low, bootstrap_res.confidence_interval.high
    return result, ci

The main idea I tried was to use resample with stratify=y_true and input that bootstrapped distribution into the scipy.stats.bootstrap function. This fails when the bootstrap method is not "percentile" because the bootstrap function calls on statistic for the evaluation of the confidence limits.

A more simple example to test when statistic is called is the following:

from scipy.stats import bootstrap
from dataclasses import dataclass

@dataclass
class BootstrapResult:
    bootstrap_distribution: np.ndarray

def noisy_mean(arr):
    print("HI", arr)
    return np.mean(arr)

bootstrap(([1,2,3,4],), noisy_mean, n_resamples=0, #method='percentile', 
          bootstrap_result=BootstrapResult(bootstrap_distribution=np.array([5,6,7,8,9])))

Context:
I would like to use this package for multi class AUROC. However, there are no easy to find methods which compute analytical confidence intervals for the one-vs-rest and one-vs-one cases. This means I would use bootstrapping to compute the confidence interval. Sometimes the bootstrapping method would select (randomly) a subset of y_true with all of the same classes. This happens more frequently with imbalanced datasets (which are common in healthcare). This would break the AUROC (as it is not defined in that case) hence throw an error in my code. It seems like using stratified bootstrapping (where the proportion of classes after resampling stays the same) avoids this issue (because there would be more than one class in the sample). Hence, I would like to introduce this feature. However I am having difficulty actually constructing the solution.

Thank you for this fantastic package. It is very helpful and I believe it to be a new gold standard for ML evaluation.

Validation

Great work!
How can I trust the confidence intervals it returns - did you do anykind of validation - maybe comparing results with the standard R implementations?

Would be cool to have like a (small) test report for some examples - in case you are in a regulated environment.

recall_score_bootstrap does not match tpr_score_bootstrap

Sensitivity, aka true positive rate, should be calculated consistently accross the library.
I can understand that there will be slight differences when using bootstrap methods for calculating the confidence intervals, but not inconsistency like the one in this minimal working example:

from confidenceinterval.takahashi_methods import recall_score_bootstrap, precision_score_bootstrap
from confidenceinterval.binary_metrics import tnr_score_bootstrap, ppv_score_bootstrap, tpr_score_bootstrap
from numpy.testing import assert_allclose, assert_almost_equal


def get_samples_based_on_tfpn(tp, tn, fp, fn) -> tuple[list[bool], list[int]]:
    ground_truth = [1] * tp + [0] * tn + [1] * fn + [0] * fp
    predictions = [1] * tp + [0] * tn + [0] * fn + [1] * fp
    return ground_truth, predictions

tp_, tn_, fp_, fn_ = 679, 1366, 69, 69

y_true, y_pred = get_samples_based_on_tfpn(tp_, tn_, fp_, fn_)

sensitivity_, sensitivity_ci_ = recall_score_bootstrap(y_true=y_true, y_pred=y_pred, confidence_level=0.95, method='bootstrap_bca')
sensitivity, sensitivity_ci = tpr_score_bootstrap(y_true=y_true, y_pred=y_pred, confidence_level=0.95, method='bootstrap_bca')

assert_almost_equal(sensitivity, sensitivity_, decimal=3, err_msg=f"Sensitivity: {sensitivity} != {sensitivity_}")

AssertionError: 
Arrays are not almost equal to 3 decimals
Sensitivity: 0.9077540105738298 != 0.9367842418689877
 ACTUAL: 0.9077540105738298
 DESIRED: 0.9367842418689877

I have looked into the source code and identified several inconsistencies in the docstrincs, where the terms "sensitivity" and "specificity" were mixed arbitrarily, pointing at unchecked copypasting of code and that's where the error originates. I have no clue where the error lies, though.

Handling correctly binary classification with default parameters

The following code runs on the data in predictions.csv and uses 3 methods for a recall evaluation:

recall_score from sklearn.metrics
recall_score from confidenceinterval
Evaluation of recall by definition from a confidence matrix.

The results are different between sklearn and confidenceinterval. Is there any explanation for this effect?

import pandas as pd
import confidenceinterval
from sklearn.metrics import recall_score, confusion_matrix
df = pd.read_csv('predictions.csv')
y_true = df['true'].values.astype(bool)
y_pred = (df['pred'].values > 0.5).astype(bool)
# using confidenceinterval
re1 = confidenceinterval.recall_score(y_true, y_pred)[0]
# using sklearn
re2 = recall_score(y_true, y_pred)
# direct calculation from confusion matrix
conf_mat = confusion_matrix(y_true, y_pred)
re3 = conf_mat[1, 1] / (conf_mat[1, 1] + conf_mat[1, 0])print(f'{re1:.4f}, {re2:.4f}, {re3:.4f}')

Results are: 0.7789, 0.7820, 0.7820

Confidence Interval for Macro-F1 is always [0,1]

Hi,

Thanks for a great library! How come the macro-f1 has been hardcoded with a CI of [0,1] in this line?

confidenceinterval/confidenceinterval/metrics.py

Lines 527 to 528 in 7b47bfd

    
           if compute_ci: 
        
               return f1, [0, 1]

I believe we should be using the equation from the paper to compute first variance, then computing the CI from that. Thanks!

Trouble with Unpack

When running this line from binary_metrics.py

from typing_extensions import Unpack

I get the following error:

ImportError: cannot import name 'Unpack' from 'typing_extensions' (/opt/anaconda3/envs/pytorch_env/lib/python3.8/site-packages/typing_extensions.py)

Wonder if no one else is getting this.

I did a pip install of confidenceinterval

jacobgil / confidenceinterval Goto Github PK

confidenceinterval's Introduction

confidenceinterval's People

Contributors

Stargazers

Watchers

Forkers

confidenceinterval's Issues

Add the most basic metrics such as precisioin, specificity, sensitivity and increase docs.

Implementing Stratified Sampling

Validation

recall_score_bootstrap does not match tpr_score_bootstrap

Handling correctly binary classification with default parameters

Confidence Interval for Macro-F1 is always [0,1]

Trouble with Unpack

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent