raamana / confounds Goto Github PK

Conquering confounds and covariates: methods, library and guidance

Home Page: https://raamana.github.io/confounds

License: Apache License 2.0

Makefile 1.65% Python 98.35%

confound covariates machine-learning cross-validation scikit-learn statistics regression classification neuroimaging neuroscience

confounds's Introduction

Conquering confounds and covariates in machine learning

News

Hackathon folks: Those coming here from the hackathon, please go here to learn some ideas for contribution: ohbm/hackathon2021#34
The previous slides for the OHBM Hackathon and Open Science Room are here: https://crossinvalidation.com/2020/03/04/conquering-confounds-and-covariates-in-machine-learning/

Vision / Goals

The high-level goals of this package is to develop high-quality library to conquer confounds and covariates in ML applications. By conquering, we mean methods and tools to

visualize and establish the presence of confounds (e.g. quantifying confound-to-target relationships),
offer solutions to handle them appropriately via correction or removal etc, and
analyze the effect of the deconfounding methods in the processed data (e.g. ability to check if they worked at all, or if they introduced new or unwanted biases etc).

Documentation

https://raamana.github.io/confounds

Methods

Available:

Residualize (e.g. via regression)
Augment (include confounds as predictors)
Some utils

To be added:

Harmonize (correct batch effects via rescaling or normalization etc)
Stratify (sub- or re-sampling procedures to minimize confounding)
Full set of utilities (Goals 1 and 3)
reweight (based on propensity scores as in IPW, or based on confounds)
estimate propensity scores

In a more schematic way:

Resources

any useful resources; papers, presentations, lectures related to the problems of confounding can be found here https://github.com/raamana/confounds/blob/master/docs/references_confounds.rst

Citation

If you found any parts of confounds to be useful in your research, directly or indirectly, I'd appreciate if you could cite the following:

Pradeep Reddy Raamana (2020), "Conquering confounds and covariates in machine learning with the python library confounds", Version 0.1.1, Zenodo. http://doi.org/10.5281/zenodo.3701528

Contributors are most welcome.

Your contributions of all kinds will be greatly appreciated. Learn how to contribute to this repo here.

All contributors making non-trivial contributions will be

publicly and clearly acknowledged on the authors page
become an author on the [software] paper to be published when it's ready soon.

confounds's People

Contributors

Stargazers

Watchers

Forkers

dinga92 saigerutherford tjays7 christiangerloff zuxfoucault neuroquant ljollans jrasero vishalbelsare vincent-wq nian-jingqing maxwellreynolds sinhaharsh

confounds's Issues

better validation of inputs to Deconfounders

the #19 reminds me of how some users can be confused given the code lets the second argument to .fit() and .transform() optional with y=None. The only reason we have y=None is to try follow sklearn conventions and to pass their tests, but given we can't pass them anyway, we should tighten them up and make it an error to not supply the second [necessary] input argument.

cc @jrasero @jameschapman19

Implement metrics to quantify confound to target relationships

some ideas are correlation, R^2, delta R^2 etc

drop-in replacements for cross_val_predict and cross_val_score etc

Pradeep,

could something like this be of interest for the library?

The idea would be to create a class that would do fit and predict including deconfounding and the use of the estimator in an encapsulated way.

Below is a skeleton example. This would only deconfound the input data.

cross_val_predict and cross_val_score functions could as well be implemented.

from sklearn.base import clone

class SklearnWrapper():

    def __init__(self,
                 deconfounder,
                 estimator):

        self.deconfounder = deconfounder
        self.estimator = estimator

    def fit(self,
            input_data,
            target_data,
            confounders,
            sample_weight=None):

        # clone input arguments
        deconfounder = clone(self.deconfounder)
        estimator = clone(self.estimator)

        # Deconfound input data
        deconf_input = deconfounder.fit_transform(input_data, confounders)
        self.deconfounder_ = deconfounder

        # Fit deconfounded input data
        estimator.fit(deconf_input, target_data, sample_weight)
        self.estimator_ = estimator

        return self

    def predict(self,
                input_data,
                confounders):

        deconf_input = self.deconfounder_.transform(input_data, confounders)

        return self.estimator_.predict(deconf_input)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

confounds version:0.1.3
Python version: 3.10
Operating System: mac

Having the error eventhough the data is clean, has no Na nor NaN values

C_sample = graph_corr_1[confound_cols]
X_sample = graph_corr_1.drop(confound_cols, axis=1)

resid = Residualize()
resid.fit(C_sample)
graph_corr_2 = resid.transform(C_sample)

Performance score stratified by confound

utils.score_stratified_by_confound()

Helper to summarize the performance score (accuracy, MSE, MAE etc) for each
level or variant of confound. This is helpful to assess any bias towards a
particular value when confounds are categorical (such as site or gender). So
if the MSE (of target) for Females is much lower compared to Males, then it
may indicate a potential bias of the model towards Females (due to imbalance in
size?)

Implement Propensity Score estimation

Add tests for Residualize() with non-linear models

Including causal discovery methods such as LiNGAM

Should we consider offering causal discovery based on LiNGAM ?
For ex. Yang [2] applies LiNGAM for recognizing brain connectivity patterns with fMRI data.

References

Python package for causal discovery based on LiNGAM : https://www.jmlr.org/papers/v24/21-0321.html
Yang and Suzuki, The Functional LiNGAM

Add tutorial notebooks, with few example use-cases

the usage can be easily turned into a tutorial notebook: https://raamana.github.io/confounds/usage.html

we can add more depending on the utilities and helpers etc

Add tests for DummyDeconfounding() method

Error fitting Residualize

confounds version: 0.1.1
Python version: 3.9.7
Operating System: macOS 11.6

Description

I tried to run the example code with some dummy data, but get an error when I try to fit Residualize

What I Did

# Using the diabetes dataset as an example
from sklearn import datasets

df = datasets.load_diabetes(as_frame=True)['data']
X = df[['bmi', 'age', 's1']].values # some predictors
y = df['s6'].values # the outcome variable
c = df['sex'].values # a confound - does not matter which

# Splitting into a training and a test set
from sklearn.model_selection import train_test_split

train_ind, test_ind = train_test_split(np.arange(0, len(y)), test_size=0.2)
train_X = X[train_ind, :]
train_y = y[train_ind]
train_C = c[train_ind]

test_X = X[test_ind, :]
test_y = y[test_ind]
test_C = c[test_ind]

# Fitting Residualize to remove the confound
from confounds import Residualize

resid = Residualize()
resid.fit(train_X, train_C)
deconf_train_X = resid.transform(train_X, train_C)

Error message:

TypeError: check_is_fitted() takes from 1 to 2 positional arguments but 3 were given
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/m0/mddm8pfx1vs3q52qvgx4mpxw0000gp/T/ipykernel_27134/3595471338.py in <module>
      1 resid = Residualize()
      2 resid.fit(train_X, train_C)
----> 3 deconf_train_X = resid.transform(train_X, train_C)

/opt/anaconda3/envs/brain_shadows/lib/python3.9/site-packages/confounds/base.py in transform(self, X, y)
    186         """Placeholder to pass sklearn conventions"""
    187 
--> 188         return self._transform(X, y)
    189 
    190 

/opt/anaconda3/envs/brain_shadows/lib/python3.9/site-packages/confounds/base.py in _transform(self, test_features, test_confounds)
    192         """Actual deconfounding of the test features"""
    193 
--> 194         check_is_fitted(self, 'model_', 'n_features_')
    195         test_features = check_array(test_features, accept_sparse=True)
    196 

TypeError: check_is_fitted() takes from 1 to 2 positional arguments but 3 were given

Comment

It looks like there is some incompatibility, but I'm not sure what package is causing the error. Any help would be greatly appreciated!

literature on confounding in general and types of bias that are not due to confounding
Recent methods in the functional genomics literature such as RUV-4 and SCmerge
Papers evaluating harmonization methods and whether or not they succeed at de-confounding.
https://pubmed.ncbi.nlm.nih.gov/22101192/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4679071/