Estimating Bias in semi-supervised learning via Nested EM on Gaussian Mixture Models

This is a python implementation for the Expectation-Maximization optimization for parameters of a nested mixture of Gaussians (a mixture of Gaussian mixtures) presented in our paper An Approach to Identifying and Quantifying Bias in Biomedical Data. If labeled data for for the classes are available, they are used in the optimization.

Background

Often data features might be available, but ground truth labels for the data may be insufficient. Thus, the semi-supervised setting, where most data do not have accompanying labels, is common in many applications.

In addition to working in the domain of semi-supervision, the labeled data that is available may not be representative of the true underlying population of the classes represented by the labels.

Observed distributions ≠ True distributions

We’re interested in detecting and quantifying the level of bias in these labeled data in the semi-supervised learning setting.

More precisely, let’s consider a classification problem where we are interested in two classes, let’s call them the positive and the negative classes. We may have a large trove unlabeled data, represented in the figure above by the black distribution, but we don’t know the underlying class labels of that data.

Ideally, we would have labels for each class that are drawn from the same underlying population, but this may not be the case.

Instead, what is available to us is some sample of labeled data, but due to feasibility of labeling, sampling bias, the availability of only some subpopulations, or other mechanisms, the distribution within each label class is not representative of the corresponding class distribution.

We call this systematic discrepancy between the labeled and unlabeled class distributions the bias in the labels. And we are interested in detecting and quantifying this bias without knowing the true unlabeled class distributions.

Modeling Bias

Quantifying the disagreement between true class distribution and those in the labeled set in general is difficult because the parameters of both the labeled and unlabeled distributions are unknown.

We must make some assumptions in order to jointly model all of the unknowns in the problem.

We assume that each true underlying class distribution can be represented as a mixture of K Gaussians And the corresponding labeled class distribution can be represented as a mixture of those same K Gaussians but with different mixing proportions.

Usage

The notebooks demo.ipynb and demo-2D.ipynb have examples for learning the parameters in 1 and 2 dimensions

To estimate model parameters from data,

Kfit = [2,2]  # number of components to use in parameter estimation for each class

alphas, w, w_l, sigmas, mus, lls = PU_nested_em_opt(unlabeled_data, [labeled_pos, labeled_neg],
                                                    Kfit, max_steps=5000)

Notes

The MATLAB directory contains the MATLAB code used in the orignal paper. main.m contains an optimization example for a dataset in dataset.mat for one initialization.

To run with your own data, specify data matrices unlabeled, labeled_pos, labeled_neg with samples as rows and featues as columns. Specify the number of components with num_componets and an optional progress bar. Run optimization with

[alpha, negative_params, positive_params, w_labeled] = ...
    PNU_nested_em(unlabeled, labeled_pos, labeled_neg, num_components)

Reference

If you use this code for research, please cite our accompaying paper:

@inproceedings{depaoliskaluza2022bias,
  title={An Approach to Identifying and Quantifying Bias in Biomedical Data},
  author={De Paolis Kaluza, M. Clara and Jain, Shantanu and Radivojac, Predrag},
  booktitle={Pacific Symposium on Biocomputing 2023: Kohala Coast, Hawaii, USA, 3--7 January 2023},
  pages={311--322},
  year={2022},
}

claradepaolis / bi-est Goto Github PK

bi-est's Introduction

Estimating Bias in semi-supervised learning via Nested EM on Gaussian Mixture Models

Background

Modeling Bias

Usage

Notes

Reference

bi-est's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent