smirf: Single or multiple imputation of missing data using random forests

Stephen Wade

smirf is an implementation of Stekhoven and Bühlmann's (2012) missForest algorithm. It is an iterative procedure for imputation of missing data of mixed type via fitting random forests to the observed data. It uses a fast implementation of random forests supplied by ranger (Wright and Ziegler, 2017).

Example

require(smirf)

# Add some missing values completely at random to iris
set.seed(1)

prop_missing <- 0.2
data_ <- iris
n_prod_m <- prod(dim(data_))
data_[arrayInd(sample.int(n_prod_m, size=n_prod_m * prop_missing),
               .dim=dim(data_))] <- NA

# Impute missing data - here using the original missForest stopping criterion
res <- smirf(data_, stop.measure=measure_stekhoven_2012)

Installation

Installation is easy using devtools:

library(devtools)
install_github('stephematician/smirf')

ranger and rlang libraries are also required, both are available via CRAN:

install.packages(c('ranger', 'rlang'))

Alternatives

For those looking for an alternative to this package:

missRanger - uses the same underlying forest training package (ranger) as here.
missForest - the original missForest package.

Details

The original missForest trains and predicts the forests via randomForest, the key differences between this implementation and the original are that;

each random forest is fit via ranger (Wright and Ziegler, 2017), which is optimised for training on high dimensional data;
the stopping criterion has been updated (see Stopping criterion) or a user may provide their own function;
a user may specify the initial guess for the missing values in the first step of the iterative procedure;
- by default, the original Stekhoven and Bühlmann (2012) missForest approach uses the mean or most frequent value of the complete cases of each variable;
- an alternative is to sample (with replacement) from the complete cases of each variable (similar to mice), and;
a user may specify to run the algorithm as many times as desired (or as the machine can handle!) to perform a kind of pseudo-multiple imputation.

In addition the iterative procedure can be (optionally) modified;

instead of using the 'whole-of-forest' prediction for each missing value, the prediction a randomly sampled tree in the forest ('tree-sampling') may be used throughout the process;
'Gibbs' sampling may be used as new predictions of missing values become available for each variable;
as in Bartlett (2014) a bootstrap of observed data may be used for training each forest, and;
the forests may be trained on all rows of the data, including predictions for missing values (this is experimental - inspired by MICE).

Combining tree-sampled missing values, Gibbs sampling and a random initial state, is like the implementation of Multiple Imputation via Chained Equations (MICE) using random forests proposed by Doove et al (2014) but for one difference - the predictions from each tree are based on the mean of the terminal node rather than a sample of the training data that belong to the terminal node.

Stopping criterion

The original stopping criterion (Stekhoven and Bühlmann, 2012) is not location and scale invariant. As an experiment, it has been replaced by a correlation based calculation by default. The user may specify their own stopping criterion, and for this purpose Stekhoven and Bühlmann's (2012) criterion is included as an example.

By default, at each iteration the (rank) correlation between the current data and the previous data is estimated for ordered/continuous data. For non-ordered data, the proportion of stationary values in categorical data is calculated. When both the mean correlation and proportion of stationary values decreases, the imputation procedure has converged.

Author's note and lament:

I am speculating that this criterion identifies when the unexplained variation of the random forest model of the (complete) data is dominating, and that 'entropy' (possibly incorrect use of this term) has been optimised. It still seems unpleasant to have incomparable measures for the different types of data. I hope that other mathematicians, more savvy than I, can investigate this.

To-do

Not exhaustive:

prepare CRAN submission;
provide an argument to impute from most missing column to least, similar to missForest;
evaluation of error given a 'true' data set, similar to missForest;
calculation of a 'mean' OOB error (only variable-wise is currently available), similar to missForest, and;
implement predictive mean matching as in missRanger.

References

Bartlett, J., 2014. 'Methodology for multiple imputation for missing data in electronic health record data', presented to 27th International Biometric Conference, Florence, July 6-11.

Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi.10.1016/j.csda.2013.10.025

Stekhoven, D.J. and Bühlmann, P., 2012. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), pp. 112-118. doi.1.1093/bioinformatics/btr597

Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi.10.18637/jss.v077.i01.

csetraynor / smirf Goto Github PK

smirf's Introduction

smirf: Single or multiple imputation of missing data using random forests

Example

Installation

Alternatives

Details

Stopping criterion

To-do

References

smirf's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent