nhejazi / medshift Goto Github PK

9.0 4.0 3.0 803 KB

:package: :game_die: R/medshift: Causal Mediation Analysis for Stochastic Interventions

Home Page: https://codex.nimahejazi.org/medshift

License: Other

Makefile 0.55% TeX 2.45% R 97.00%

causal-inference mediation-analysis treatment-effects stochastic-interventions machine-learning targeted-learning inverse-probability-weights

medshift's Introduction

R/`medshift`

Causal Mediation Analysis for Stochastic Interventions

Authors: Nima Hejazi and Iván Díaz

What’s `medshift`?

The medshift R package is designed to provide facilities for estimating a parameter that arises in a decomposition of the population intervention causal effect into the (in)direct effects under stochastic interventions in the setting of mediation analysis. medshift is designed as an implementation to accompany the methodology described in Dı́az and Hejazi (2020). Implemented estimators include the classical substitution (G-computation) estimator, an inverse probability weighted (IPW) estimator, an efficient one-step estimator using cross-fitting (Pfanzagl and Wefelmeyer 1985; Zheng and van der Laan 2011; Chernozhukov et al. 2018), and a cross-validated targeted minimum loss (TML) estimator (van der Laan and Rose 2011; Zheng and van der Laan 2011). medshift integrates with the sl3 R package (Coyle et al. 2022) to allow constructed estimators to leverage machine learning for nuisance estimation.

Installation

Install the most recent version from the master branch on GitHub via remotes:

remotes::install_github("nhejazi/medshift")

Example

To illustrate how medshift may be used to estimate the effect of applying a stochastic intervention to the treatment (A) while keeping the mediator(s) (Z) fixed, consider the following example:

library(data.table)
library(medshift)

# produces a simple data set based on ca causal model with mediation
make_simple_mediation_data <- function(n_obs = 1000) {
  # baseline covariate -- simple, binary
  W <- rbinom(n_obs, 1, prob = 0.50)

  # create treatment based on baseline W
  A <- as.numeric(rbinom(n_obs, 1, prob = W / 4 + 0.1))

  # single mediator to affect the outcome
  z1_prob <- 1 - plogis((A^2 + W) / (A + W^3 + 0.5))
  Z <- rbinom(n_obs, 1, prob = z1_prob)

  # create outcome as a linear function of A, W + white noise
  Y <- Z + A - 0.1 * W + rnorm(n_obs, mean = 0, sd = 0.25)

  # full data structure
  data <- as.data.table(cbind(Y, Z, A, W))
  setnames(data, c("Y", "Z", "A", "W"))
  return(data)
}

# set seed and simulate example data
set.seed(75681)
example_data <- make_simple_mediation_data()

# compute one-step estimate for an incremental propensity score intervention
# that triples (delta = 3) the individual-specific odds of receiving treatment
os_medshift <- medshift(W = example_data$W, A = example_data$A,
                        Z = example_data$Z, Y = example_data$Y,
                        delta = 3, estimator = "onestep",
                        estimator_args = list(cv_folds = 3))
summary(os_medshift)
#>      lwr_ci   param_est      upr_ci   param_var    eif_mean   estimator 
#>      0.7401    0.788136    0.836172    0.000601 1.64686e-17     onestep

For details on how to use data adaptive regression (machine learning) techniques in the estimation of nuisance parameters, consider consulting the vignette that accompanies this package.

Issues

If you encounter any bugs or have any specific feature requests, please file an issue.

Contributions

Contributions are very welcome. Interested contributors should consult our contribution guidelines prior to submitting a pull request.

Citation

After using the medshift R package, please cite the following:

    @article{diaz2020causal,
      title={Causal mediation analysis for stochastic interventions},
      author={D{\'\i}az, Iv{\'a}n and Hejazi, Nima S},
      year={2020},
      url = {https://doi.org/10.1111/rssb.12362},
      doi = {10.1111/rssb.12362},
      journal={Journal of the Royal Statistical Society: Series B
        (Statistical Methodology)},
      volume={},
      number={},
      pages={},
      publisher={Wiley Online Library}
    }

    @manual{hejazi2020medshift,
      author = {Hejazi, Nima S and D{\'\i}az, Iv{\'a}n},
      title = {{medshift}: Causal mediation analysis for stochastic
        interventions},
      year  = {2020},
      url = {https://github.com/nhejazi/medshift},
      note = {R package version 0.1.4}
    }

License

The contents of this repository are distributed under the MIT license. See below for details:

MIT License

Copyright (c) 2018-2022 Nima S. Hejazi

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

References

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1). https://doi.org/10.1111/ectj.12097.

Coyle, Jeremy R, Nima S Hejazi, Ivana Malenica, Rachael V Phillips, and Oleg Sofrygin. 2022. sl3: Modern Pipelines for Machine Learning and Super Learning. https://github.com/tlverse/sl3. https://doi.org/10.5281/zenodo.1342293.

Dı́az, Iván, and Nima S Hejazi. 2020. “Causal Mediation Analysis for Stochastic Interventions.” Journal of the Royal Statistical Society: Series B (Statistical Methodology). https://doi.org/10.1111/rssb.12362.

Pfanzagl, J, and W Wefelmeyer. 1985. “Contributions to a General Asymptotic Statistical Theory.” Statistics & Risk Modeling 3 (3-4): 379–88.

van der Laan, Mark J, and Sherri Rose. 2011. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Science & Business Media.

Zheng, Wenjing, and Mark J van der Laan. 2011. “Cross-Validated Targeted Minimum-Loss-Based Estimation.” In Targeted Learning, 459–74. Springer.

medshift's People

Contributors

Stargazers

Watchers

Forkers

yongfuyu zzruc kennethataylor

medshift's Issues

Nuisance parameter phi should use training data for one-step

Currently, computing the nuisance parameter phi does not make use of the training-validation split necessary for computing a cross-fitted one-step estimator. That is, phi is computed on only a single data set https://github.com/nhejazi/medshift/blob/master/R/fit_mechanisms.R#L280-L299. In computing the cross-fitted one-step estimator, only the validation data is used for phi https://github.com/nhejazi/medshift/blob/master/R/estim_helpers.R#L163-L167.

Functionality for true continuous treatment

Hey Nima,

Is there work underway to make this package work for continuous treatment? I've tested it under simple continuous A situations and get errors related to eif_component_names for onestep estimation and estimator_args[["max_iter"]] error for tmle.

missing outcome support

We should implement a procedure that estimates the full data parameter in the presence of a censoring process, e.g., a data structure like O = (W, A, Z, C, CY), for censoring indicator C. Such an approach would be based on the joint intervention setting C = 1 and the joint intervention on {A, Z} that defines our causal parameters. The estimation procedures would then simply incorporate an extra set of IP weights, specifically to address this intervention, i.e., 1/g(C = 1 | …).

testing Monte Carlo integration

The function https://github.com/nhejazi/medshift/blob/continuous/R/utils.R#L235-L289 for Monte Carlo integration in the case of continuous A may not require re-scaling by the range of the uniform distribution (i.e., no need to use the multiplier b - a) notes @jeremyrcoyle. This should be tested --- a simple case would be to let the density mechanism be a normal distribution and the weighting mechanism to be the identity.

utility function for IP weights

used in the re-weighted estimator and the efficient estimator, should just be a utility function rather than manual computation

problems in efficient estimator

https://github.com/nhejazi/medshift/blob/master/R/cv_eif.R#L117 uses the predicted values for E, given A = 1, as those for the A = 0 group incorrectly. This should be (1 - E) for those observations.
the stabilized AIPW is computed incorrectly, note that E(g/e) = 1 is not a conditional expectation, thus https://github.com/nhejazi/medshift/blob/master/R/cv_eif.R#L123-L124 should use a mean over the re-constructed vector of weights

utility function for Dzw/substitution

The procedure to compute the Dzw component of the EIF and the substitution estimator is nearly identical (with the substitution estimator simply requiring an empirical mean be taken over the Dzw vector for that component of the EIF values). This should be abstracted into a single utility function to be used in both situations.

Support for observation-level IDs

In some applications, it may be useful to support the presence of hierarchical structures in which individual units belong to clusters (e.g., families, hospitals, schools). To support valid estimation and inference in such settings, observation-level IDs must be passed to nuisance regression estimators (such that cross-validation respects these) and to the inferential machinery (averaging EIF estimates at the cluster level).

weight stabilization

Note that the expectation of the weights g / e is equal to one. A good way to stabilize the AIPW estimator is to divide the weights by their empirical sample average. We should implement this.

Multiple Mediators

Hi Nima,

First of all, thank you for your amazing work!
I am trying to decompose the effect of multiple mediators ideally using xgboost. On the documentation, I see that medshift should be able to work with multiple mediators but I keep getting the estimation of only one mediator, even using the default example.

Best wishes,
Ahmed


library(medshift)
library(data.table)

make_simple_mediation_data <- function(n_obs = 1000) {
  W <- rbinom(n_obs, 1, prob = 0.50) 
  A <- as.numeric(rbinom(n_obs, 1, prob = W / 4 + 0.1))
  z1_prob <- 1 - plogis((A^2 + W) / (A + W^3 + 0.5))
  Z <- rbinom(n_obs, 1, prob = z1_prob)
  Y <- Z + A - 0.1 * W + rnorm(n_obs, mean = 0, sd = 0.25)
  
  data <- as.data.table(cbind(Y, Z, A, W))
  setnames(data, c("Y", "Z", "A", "W"))
  return(data)
}
set.seed(75681)
example_data <- make_simple_mediation_data()
example_data$ZZ<-sample(c(0,1),dim(example_data)[1],replace = TRUE)

os_medshift <- medshift(W = example_data$W, A = example_data$A,
                        Z = cbind(example_data$Z,example_data$ZZ), Y = example_data$Y,
                        delta = 3, 
                        g_learners =sl3::Lrnr_xgboost$new() ,
                        e_learners =sl3::Lrnr_xgboost$new() ,
                        m_learners =sl3::Lrnr_xgboost$new() ,
                        phi_learners =sl3::Lrnr_xgboost$new() ,
                        estimator = "onestep",
                        estimator_args = list(cv_folds = 3))
summary(os_medshift)

clearer documentation and naming

The documentation and organization of some parts of the package leave something to be desired. In particular, several new functions require documenting. Also, the naming of the estimators should be slightly revised for clarity --- e.g., "reweighted" -> "ipw", "efficient" -> "onestep"

TML estimator for binary interventions

TMLE for the binary intervention case should be easy since the exponential tilt defining D_A gives it a nice form (no integration needed, unlike the case for continuous A). We should implement this using the framework exposed in tmle3.

test indexing approaches

A test should be written to ensure that indexing is done correctly for Dy, that is, A * component and (1 - A) * component should be sufficient for indexing

allow ensemble learning for phi

Allow option for fitting the nuisance parameter Phi via arbitrary algorithms, as already implemented for other nuisance parameters

Scaling transformation of outcome variable

It's generally the case that a continuous-valued outcome variable is re-scaled to fall in the interval [0, 1] via the transformation Y_scaled = Y - min(Y) / max(Y) - min(Y) for the purposes of estimation. Upon completion of the estimation procedure, the results should then be back-transformed to the original scale. A pair of functions for performing this transformation should be implemented. Note that, once this change is made, it will be necessary to edit the sl3_Task objects for each nuisance parameter regression task to manually specify family = "quasibinomial" in order to indicate that the Y values are not truly binary (\in {0, 1}) but rather simply fall in the interval.

stochastic interventional (in)direct effects

As of 3e65185, support for the stochastic interventional effects (see methods manuscript at https://arxiv.org/abs/2009.06203) is included in a helper package, https://github.com/nhejazi/medshift/tree/master/sandbox/intmedlite. These estimators can be integrated into the package with only relatively minor changes to the code; however, for accessibility, this integration should be prioritized.

Check that CV-folds is greater than 1

There should be a check, implemented with assertthat::assert_that(), to make sure that the number of folds specified for cross-validation is greater than 1. In the current implementation, origami::cross_validate will fail for V = 1 with an ambiguous/confusing error message, due to the way in which make_folds generates the structure of class folds. There's really no good reason that the case V = 1 needs to be supported, since using cross-validation / cross-fitting in constructing the AIPW estimator has theoretical benefits anyway.

TODO

TMLE for the binary intervention case (should be easy since the exponential tilt defining D_A gives it a nice form --- no integration needed, unlike the case for continuous A) --- should do this using tmle3
Re-organization of package contents to begin accommodating continuous interventions; some machinery can be easily borrowed from txshift

Arbitrary fold structures for one-step estimator

Currently, this line in the function est_onestep, which implements the one-step estimator, restricts the use of origami::make_folds. This should be generalized to allow the number of folds and the specific fold function used to be set arbitrarily by the user. A default might be folds <- origami::make_folds(data, fold_fun = folds_vfold, V = 10).

propensity score truncation

In the case of a binary intervention, both of the nuisance parameters G = P(A | W) and E = P(A | Z, W) are propensity scores, which may be susceptible to instability in the case of (near) practical violations of the assumption of positivity. It would be best to implement a flexible approach to automatically truncate estimated propensity scores, perhaps by default to the range (0.01, 0.99).

Support for continuous-valued exposures

Some re-organization of the package contents may be necessary to begin accommodating continuous interventions.
Some machinery can (probably) be easily borrowed from txshift.

substitution estimator should use counterfactuals

In https://github.com/nhejazi/medshift/blob/master/R/est_sub.R#L36-L44, the substitution estimator uses predicted values for the sub-groups A = 1 and A = 0, it should use counterfactual values similar to how Dzw is computed for the efficient estimator