Coder Social home page Coder Social logo

confidenceinterval's Introduction

Jacob's github stats

confidenceinterval's People

Contributors

jacobgil avatar zaijab avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

confidenceinterval's Issues

Add the most basic metrics such as precisioin, specificity, sensitivity and increase docs.

The metrics with random three letter acronyms are all nice and cryptic, but I believe it would be beneficial to add basic description of the methods in the docstrings while also mentioning the full names of the methods.
I may know what a FPR is, but some of those metrics I had to google.

Also, I was not able to find metrics like specificity, sensitivity nor precision, which I consider as the most wanted.

Thanks for your work anyways, I believe you are doing great job.
If I knew how bootstrapping works and stuff, I would offer to contribute, but I dont have too much time to study that rn, so maybe later.

Implementing Stratified Sampling

Aloha @jacobgil,

Concern:

Is it possible to implement stratified sampling for use in the bootstrapping process? In sklearn.utils.resample there is an extra parameter stratify which takes an array of the same shape of the data and samples the data in proportion to the stratify parameter. The current method bootstraps indices which are not linked to the class of the data.

Attempt at solution:

from scipy.stats import bootstrap
import numpy as np
from typing import List, Callable, Optional, Tuple
from sklearn.utils import resample

from dataclasses import dataclass
@dataclass
class BootstrapResult:
    bootstrap_distribution: np.ndarray

bootstrap_methods = [
    'bootstrap_bca',
    'bootstrap_percentile',
    'bootstrap_basic']


class BootstrapParams:
    n_resamples: int
    random_state: Optional[np.random.RandomState]


def bootstrap_ci(y_true: List[int],
                 y_pred: List[int],
                 metric: Callable,
                 confidence_level: float = 0.95,
                 n_resamples: int = 9999,
                 method: str = 'bootstrap_bca',
                 random_state: Optional[np.random.RandomState] = None,
                 strata: Optional[List[int]] = None) -> Tuple[float, Tuple[float, float]]:

    def statistic(*indices):
        indices = np.array(indices)[0, :]
        #return metric(np.array(y_true)[indices], np.array(y_pred)[indices])

        try:
            return metric(np.array(y_true)[indices], np.array(y_pred)[indices])
        except:
            print('I failed lol', indices, np.unique(np.array(y_true)[indices]))
            pass

    assert method in bootstrap_methods, f'Bootstrap ci method {method} not in {bootstrap_methods}'

    indices = (np.arange(len(y_true)), )


    bootstrap_distribution = [metric(*resample(y_true, y_pred, stratify=y_true))
                              for _ in range(n_resamples)]

    bootstrap_res_test = BootstrapResult(bootstrap_distribution=np.array(bootstrap_distribution)) 
    
    #print(bootstrap_res)

    bootstrap_res = bootstrap(indices,
                              statistic=statistic,
                              n_resamples=0,
                              confidence_level=confidence_level,
                              method=method.split('bootstrap_')[1],
                              bootstrap_result=bootstrap_res_test,
                              random_state=random_state)

    #print(bootstrap_res.bootstrap_distribution)
    
    np.testing.assert_equal(bootstrap_res.bootstrap_distribution, bootstrap_res_test.bootstrap_distribution)

    result = metric(y_true, y_pred)
    ci = bootstrap_res.confidence_interval.low, bootstrap_res.confidence_interval.high
    return result, ci

The main idea I tried was to use resample with stratify=y_true and input that bootstrapped distribution into the scipy.stats.bootstrap function. This fails when the bootstrap method is not "percentile" because the bootstrap function calls on statistic for the evaluation of the confidence limits.

A more simple example to test when statistic is called is the following:

from scipy.stats import bootstrap
from dataclasses import dataclass

@dataclass
class BootstrapResult:
    bootstrap_distribution: np.ndarray

def noisy_mean(arr):
    print("HI", arr)
    return np.mean(arr)

bootstrap(([1,2,3,4],), noisy_mean, n_resamples=0, #method='percentile', 
          bootstrap_result=BootstrapResult(bootstrap_distribution=np.array([5,6,7,8,9])))

Context:
I would like to use this package for multi class AUROC. However, there are no easy to find methods which compute analytical confidence intervals for the one-vs-rest and one-vs-one cases. This means I would use bootstrapping to compute the confidence interval. Sometimes the bootstrapping method would select (randomly) a subset of y_true with all of the same classes. This happens more frequently with imbalanced datasets (which are common in healthcare). This would break the AUROC (as it is not defined in that case) hence throw an error in my code. It seems like using stratified bootstrapping (where the proportion of classes after resampling stays the same) avoids this issue (because there would be more than one class in the sample). Hence, I would like to introduce this feature. However I am having difficulty actually constructing the solution.

Thank you for this fantastic package. It is very helpful and I believe it to be a new gold standard for ML evaluation.

Validation

Great work!
How can I trust the confidence intervals it returns - did you do anykind of validation - maybe comparing results with the standard R implementations?

Would be cool to have like a (small) test report for some examples - in case you are in a regulated environment.

recall_score_bootstrap does not match tpr_score_bootstrap

Sensitivity, aka true positive rate, should be calculated consistently accross the library.
I can understand that there will be slight differences when using bootstrap methods for calculating the confidence intervals, but not inconsistency like the one in this minimal working example:

from confidenceinterval.takahashi_methods import recall_score_bootstrap, precision_score_bootstrap
from confidenceinterval.binary_metrics import tnr_score_bootstrap, ppv_score_bootstrap, tpr_score_bootstrap
from numpy.testing import assert_allclose, assert_almost_equal


def get_samples_based_on_tfpn(tp, tn, fp, fn) -> tuple[list[bool], list[int]]:
    ground_truth = [1] * tp + [0] * tn + [1] * fn + [0] * fp
    predictions = [1] * tp + [0] * tn + [0] * fn + [1] * fp
    return ground_truth, predictions

tp_, tn_, fp_, fn_ = 679, 1366, 69, 69

y_true, y_pred = get_samples_based_on_tfpn(tp_, tn_, fp_, fn_)

sensitivity_, sensitivity_ci_ = recall_score_bootstrap(y_true=y_true, y_pred=y_pred, confidence_level=0.95, method='bootstrap_bca')
sensitivity, sensitivity_ci = tpr_score_bootstrap(y_true=y_true, y_pred=y_pred, confidence_level=0.95, method='bootstrap_bca')

assert_almost_equal(sensitivity, sensitivity_, decimal=3, err_msg=f"Sensitivity: {sensitivity} != {sensitivity_}")
AssertionError: 
Arrays are not almost equal to 3 decimals
Sensitivity: 0.9077540105738298 != 0.9367842418689877
 ACTUAL: 0.9077540105738298
 DESIRED: 0.9367842418689877

I have looked into the source code and identified several inconsistencies in the docstrincs, where the terms "sensitivity" and "specificity" were mixed arbitrarily, pointing at unchecked copypasting of code and that's where the error originates. I have no clue where the error lies, though.

Handling correctly binary classification with default parameters

The following code runs on the data in predictions.csv and uses 3 methods for a recall evaluation:

  • recall_score from sklearn.metrics
  • recall_score from confidenceinterval
  • Evaluation of recall by definition from a confidence matrix.

The results are different between sklearn and confidenceinterval. Is there any explanation for this effect?

import pandas as pd
import confidenceinterval
from sklearn.metrics import recall_score, confusion_matrix
df = pd.read_csv('predictions.csv')
y_true = df['true'].values.astype(bool)
y_pred = (df['pred'].values > 0.5).astype(bool)
# using confidenceinterval
re1 = confidenceinterval.recall_score(y_true, y_pred)[0]
# using sklearn
re2 = recall_score(y_true, y_pred)
# direct calculation from confusion matrix
conf_mat = confusion_matrix(y_true, y_pred)
re3 = conf_mat[1, 1] / (conf_mat[1, 1] + conf_mat[1, 0])print(f'{re1:.4f}, {re2:.4f}, {re3:.4f}')

Results are: 0.7789, 0.7820, 0.7820

Trouble with Unpack

When running this line from binary_metrics.py

from typing_extensions import Unpack

I get the following error:

ImportError: cannot import name 'Unpack' from 'typing_extensions' (/opt/anaconda3/envs/pytorch_env/lib/python3.8/site-packages/typing_extensions.py)

Wonder if no one else is getting this.

I did a pip install of confidenceinterval

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.