bootstrap's Introduction

Bootstrap

A library for bootstrapping statistics.

Features

While incomplete, the library already incudes a number of features:

Bootstrap samples
Bootstrap matrices
Bootstrap statistics
- Provides SEM and confidence intervals for statistics
Jackknife samples and statistics
Two sample testing

Installation

python setup.py install

Usage

Here, we document some of the library features using the University of Wisconsin breast cancer data set. Available here. For simplicity, only the first dimension will be looked at.

import numpy as np
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

First, we will look at how the data are distributed.

import matplotlib.pyplot as plt
import seaborn as sns
plt.hist(data.data[:,0], bins=40)
plt.title('Measurements')

Next, we will bootstrap 10,000 samples, to bootstrap the mean and 95% confidence interval for the mean. Below, the mean of each bootstrapped sample is plotted, with the estimated mean and confidence intervals shown.

results = bootstrap_statistic(data.data[:,0], func=np.mean, n_samples=10000)

# Make plot of bootstrapped mean
plt.hist(results.statistics, bins=40)
plt.title('Bootstrapped Means')
plt.xlabel('Mean')
plt.ylabel('Counts')
ax = plt.gca()
ax.axvline(x=results.ci[0], color='red', linestyle='dashed', linewidth=2)
ax.axvline(x=results.ci[1], color='red', linestyle='dashed', linewidth=2)
ax.axvline(x=results.statistic, color='black', linewidth=5)

An advantage of the bootstrap method is its adaptability. For example, you can bootstrap an estimate of the 95th percentile of the data.

def percentile(data):
    """returns 95th percentile of data"""
    return np.percentile(data, 95)
    
# Bootstrap the 95th percentile
results = bootstrap_statistic(data.data[:,0], func=percentile, n_samples=10000)

# Make plot of bootstrapped 95th percentile
plt.hist(results.statistics, bins=40)
plt.title('Bootstrapped 95th Percentiles')
plt.xlabel('95th Percentile')
plt.ylabel('Counts')
ax = plt.gca()
ax.axvline(x=results.ci[0], color='red', linestyle='dashed', linewidth=2)
ax.axvline(x=results.ci[1], color='red', linestyle='dashed', linewidth=2)
ax.axvline(x=results.statistic, color='black', linewidth=5)

Additionally, the library can perform two sample testing. First lets view the distribution of the same data, but broken up by tumor type.

benign = data.data[data.target == 0]
malignant = data.data[data.target == 1]

# Plot benign and malignant samples
plt.hist(benign[:,0], bins=30, alpha=0.5, label='benign')
plt.hist(malignant[:,0], bins=30, alpha=0.5, label='malignant')
plt.legend()
plt.xlabel('Measurement')
plt.ylabel('Counts')

It appears their is a different in the groups distribution. The level of significance can be computer via the bootstrap method.

significance = two_sample_testing(benign[:, 0], malignant[:, 0],
                                  statistic_func=compare_means,
                                  n_samples=5000)
print(significance) # prints 0.0

Hmmm, with 5,000 random bootstrapped samples, not a single one had the difference of means of the observed samples.

What about a feature that is less predictive? Below, we look at feature 9.

plt.hist(benign[:,9], bins=30, alpha=0.5, label='benign')
plt.hist(malignant[:,9], bins=30, alpha=0.5, label='malignant')
plt.legend()
plt.xlabel('Measurement')
plt.ylabel('Counts')

If then bootstrap the difference between the two means, we get a non-significant difference.

significance = two_sample_testing(malignant[:, 9], benign[:, 9],
                                  statistic_func=compare_means,
                                  n_samples=5000)
print(significance) # prints 0.387

bootstrap's People

Contributors

Stargazers

Watchers

bootstrap's Issues

Parametric sampling is extremely limited.

Currently only:

normal
uniform

Common parametric distributions worth adding:

Poisson
Logistic
Beta
Gamma

Any others?

Creating bootstrap and jackknife objects.

Hi there,

I love the idea of this package! One thing that I think might be necessary as the number of resampling methods in the package grows is to start to create a more object-oriented framework for the bootstrap, jackknife, etc.

For example, there could be an abstract base class called "Statistic" or "ResampledStatistic", that would represent the base of any of the resampling methods that might be added to this package. Then, there could be Bootstrap, Jackknife, etc... objects that inherit from this class.

Benefits of this approach include --

A unified interface for data inputs and functions.
A Statistic base class could include methods for dealing with pandas dataframes, ndarrays, matricies, and other special cases which would be inherited by any class that is a sub-class to it, e.g. Bootstrap, Jackknife.
A series of specialized functions would be replaced with method calls on an object.

For example, let's say there is a matrix that would you like to sample the rows. Currently, the syntax is

bootstrap_matrixsample(data, axis=0).

But creating a Bootstrap object would lead to something like this.

bstrp = Bootstrap(data)
bstrp.sample(axis=0)

Which would return a sample of the rows in the matrix. If the Statistic base class had code to determine that the data is a matrix, then the user would not have to worry about specifying the correct function for their problem. They would know that the bstrp.sample() functionality works regardless. This functionality is what makes pandas so powerful.

I know there differences seem trivial, but as the code base grows, this sort of framework will really pay dividends, at least in saving your wrists.

I'd be happy to create a new branch and play around with this idea.

Recommend Projects

christopherjenness / bootstrap Goto Github PK

bootstrap's Introduction

Bootstrap

Features

Installation

Usage

bootstrap's People

Contributors

Stargazers

Watchers

Forkers

bootstrap's Issues

Parametric sampling is extremely limited.

Creating bootstrap and jackknife objects.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent