The goal of this document is to highlight the functionality
implemented in the package missForestPredict
and to provide
guidance for the usage of this package.
The package missForestPredict
implements the missing
data imputation algorithm used in the R package missForest
(Stekhoven and Bühlmann 2012) with
adaptations for prediction settings. The function
missForest
is used to impute a (training) dataset with
missing values and to learn imputations models that can be later used
for imputing new observations. The function
missForestPredict
is used to impute one or multiple new
observations (test set) using the models learned on the training data.
The word “Predict” in the function name should not misguide the user.
The function does not perform prediction of an outcome and is agnostic
on whether the outcome variable for a prediction model is part of the
training data or not; it will treat all columns of the provided data as
variables to be imputed.
Fast implementation
The imputation algorithm is based on random forests (Breiman 2001) as implemented in the
ranger
R package (Wright and Ziegler
2017). Ranger provides a fast implementation of random forests
suitable for large datasets as well as high dimensional data.
Saved models and initialization
The missing data in each column is initialized the mean/mode (or median/mode) of that variable derived on complete cases or a custom imputation scheme. Each variable is then imputed using the iterative algorithm of missForest (Stekhoven and Bühlmann 2012) until a stopping criterion is met. The algorithm supports all variable types (continuous and categorical with two or more levels) and uses a common stopping criterion for all variables. The initialization used for the training data and the random forest models for each iteration are saved and can be later used to impute new observations. Imputation initialization and models are by default “learned” also for variables with no missing values in the original (training) data. This allows for unfortunate situations in which new observations have different missing patterns than the one encountered in the training data (for example, because of accidental registration errors or because of unfortunate train / test split in which all missing values of a variable with low missingness fall in the test set).
Imputation of new observations
The models are applied iteratively to “predict” the missing values of each variable for the new observation, using the same number of iterations as used in the training data.
Convergence criteria
At each iteration the out-of-bag (OOB) error is calculated for each
variable separately. To obtain a global error the OOB errors for all
variables a weighted average is used, that can be controlled by the
var_weights
parameter. By default the weights are set to
the proportion of missing values of each variable.
The normalized mean squared error is used for both continuous and categorical variables. For continuous variables, it is equivalent to \(1 - R^2\). For categorical variables, it is equivalent to \(1 - BSS\) (Brier Skill Score) (Brier et al. 1950).
More information on convergence criteria and error monitoring is provided in a separate vignette.
Support for dataframe and tibble
Both dataframe (data.frame
class) and tibble
(tbl_df
class) are supported as input for the package
functions. Currently matrix input is not supported.
Iris data The iris dataset in R base contains 4 continuous variables and one categorical variable with three categories for N = 150 flowers (Anderson 1935).
Diamonds data The diamonds dataset from
ggplot2
R package contains seven continuous variables and
three categorical variables for N = 53940 diamonds (Wickham 2016).
After installing the package you can load it in your R sessions with:
We will load the iris dataset and split it in a training set (100 observations) and a test set (50 observations).
data(iris)
N <- nrow(iris)
n_test <- floor(N/3)
set.seed(2022)
id_test <- sample(1:N, n_test)
iris_train <- iris[-id_test,]
iris_test <- iris[id_test,]
We produce 10% random missing values on each column in both the training and the test set.
set.seed(2022)
iris_train_miss <- produce_NA(iris_train, proportion = 0.1)
iris_test_miss <- produce_NA(iris_test, proportion = 0.1)
head(iris_train_miss)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 2 4.9 NA 1.4 0.2 <NA>
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 NA <NA>
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 NA 3.1 1.5 0.1 setosa
head(iris_test_miss)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 55 6.5 2.8 4.6 1.5 versicolor
#> 75 6.4 2.9 NA 1.3 versicolor
#> 6 5.4 3.9 1.7 0.4 setosa
#> 123 7.7 2.8 6.7 2.0 virginica
#> 14 4.3 NA 1.1 NA setosa
#> 7 4.6 3.4 1.4 0.3 <NA>
We will impute the training set and learn the random forest
imputation models at the same time using the function
missForest
. To later use the models learned on a training
set for imputation of new observations, save_models
needs
to be set to TRUE
.
By default feedback on the number of iterations and the error
monitoring is provided. You can set verbose = FALSE
to
silence this output. More information on error monitoring is provided in
a separate vignette.
Please note that in all following examples we set the
ranger
parameter num.threads
to 2. If you are
running the code on a machine with more cores and you are willing to use
them for running the code, you can remove this parameter completely.
set.seed(2022)
iris_train_imp_object <- missForestPredict::missForest(iris_train_miss,
save_models = TRUE,
num.threads = 2)
#> Imputation sequence (missing proportion): Sepal.Length (0.1) Sepal.Width (0.1) Petal.Length (0.1) Petal.Width (0.1) Species (0.1)
#> missForest iteration 1 in progress...done!
#> OOB errors MSE: 0.1791016514, 0.1136759079, 0.205465453, 0.0486713185, 0.0307028438
#> OOB errors NMSE: 0.2603478624, 0.6149609322, 0.0688856683, 0.0819382466, 0.1383677864
#> diff. convergence measure: 0.7670999008
#> time: 0.13 seconds
#>
#> missForest iteration 2 in progress...done!
#> OOB errors MSE: 0.136743901, 0.1013522874, 0.0916661356, 0.0410223856, 0.0293914105
#> OOB errors NMSE: 0.198775288, 0.5482929344, 0.0307325778, 0.0690612551, 0.1324575808
#> diff. convergence measure: 0.037036172
#> time: 0.14 seconds
#>
#> missForest iteration 3 in progress...done!
#> OOB errors MSE: 0.1411053193, 0.1021741354, 0.086530287, 0.043991304, 0.0292682927
#> OOB errors NMSE: 0.2051151843, 0.552738946, 0.0290107001, 0.0740594344, 0.1319027285
#> diff. convergence measure: -0.0027014715
#> time: 0.12 seconds
The imputed training set can be found by extracting ximp
dataframe from the object.
iris_train_imp <- iris_train_imp_object$ximp
head(iris_train_imp)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 2 4.900000 3.313362 1.4 0.200000 setosa
#> 4 4.600000 3.100000 1.5 0.200000 setosa
#> 5 5.000000 3.600000 1.4 0.256828 setosa
#> 8 5.000000 3.400000 1.5 0.200000 setosa
#> 9 4.400000 2.900000 1.4 0.200000 setosa
#> 10 4.774814 3.100000 1.5 0.100000 setosa
We will further impute the test set using the learned imputation
models. The function missForestPredict
will:
initialize the missing values in each variable with the initialization “learned” from the training set (mean/mode)
imperatively predict the missing values of each variable using the learned random forest models for each iteration
iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object,
newdata = iris_test_miss)
head(iris_test_imp)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 55 6.5 2.80000 4.60000 1.5000000 versicolor
#> 75 6.4 2.90000 4.42546 1.3000000 versicolor
#> 6 5.4 3.90000 1.70000 0.4000000 setosa
#> 123 7.7 2.80000 6.70000 2.0000000 virginica
#> 14 4.3 3.13961 1.10000 0.2106567 setosa
#> 7 4.6 3.40000 1.40000 0.3000000 setosa
missForestPredict
can impute a new observation with
missing values. The new observation has to be provided as a dataframe
with one row and with named columns (the column names have to correspond
to the column names of the training set).
single_observation <- iris_test_miss[1,]
single_observation[1,2] <- NA
print(single_observation)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 55 6.5 NA 4.6 1.5 versicolor
single_observation_imp <- missForestPredict::missForestPredict(iris_train_imp_object,
newdata = single_observation)
print(single_observation_imp)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 55 6.5 3.00739 4.6 1.5 versicolor
missForestPredict
package can impute observations with
new missingness patterns not present in the training data as well as
variables with no missingness (complete) in training data. The
initialization and the random forest imputation models are “learned” for
all variables in the dataset, regardless of the amount of
missingness.
The function missForest
returns both the imputed
training dataframe as well as the imputation models.
str(iris_train_imp_object, max.level = 1)
#> List of 11
#> $ ximp :'data.frame': 100 obs. of 5 variables:
#> $ init :List of 5
#> $ initialization : chr "mean/mode"
#> $ impute_sequence : chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
#> $ maxiter : num 10
#> $ save_models : logi TRUE
#> $ models :List of 3
#> $ return_integer_as_integer: logi FALSE
#> $ integer_columns : chr(0)
#> $ predictor_matrix : num [1:5, 1:5] 0 1 1 1 1 1 0 1 1 1 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ OOB_err :'data.frame': 55 obs. of 7 variables:
#> - attr(*, "class")= chr "missForest"
The imputed training data is though not necessary for imputing the
test set. To avoid storing further these data in the object,
ximp
can be set to NULL.
iris_train_imp <- iris_train_imp_object$ximp
iris_train_imp_object$ximp <- NULL
iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object,
newdata = iris_test_miss)
head(iris_test_imp)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 55 6.5 2.80000 4.60000 1.5000000 versicolor
#> 75 6.4 2.90000 4.42546 1.3000000 versicolor
#> 6 5.4 3.90000 1.70000 0.4000000 setosa
#> 123 7.7 2.80000 6.70000 2.0000000 virginica
#> 14 4.3 3.13961 1.10000 0.2106567 setosa
#> 7 4.6 3.40000 1.40000 0.3000000 setosa
In ximp
is set to NULL by mistake, and the training set
imputation is lost, it can be recovered without rerunning the algorithm.
Imputing the training set with the function
missForestPredict
will give the same results as the initial
results of missForest
function because the same models are
applied to the variables in the same order and for the same number of
iterations.
set.seed(2022)
iris_train_imp_object <- missForestPredict::missForest(iris_train_miss,
save_models = TRUE,
num.threads = 2)
#> Imputation sequence (missing proportion): Sepal.Length (0.1) Sepal.Width (0.1) Petal.Length (0.1) Petal.Width (0.1) Species (0.1)
#> missForest iteration 1 in progress...done!
#> OOB errors MSE: 0.1791016514, 0.1136759079, 0.205465453, 0.0486713185, 0.0307028438
#> OOB errors NMSE: 0.2603478624, 0.6149609322, 0.0688856683, 0.0819382466, 0.1383677864
#> diff. convergence measure: 0.7670999008
#> time: 0.14 seconds
#>
#> missForest iteration 2 in progress...done!
#> OOB errors MSE: 0.136743901, 0.1013522874, 0.0916661356, 0.0410223856, 0.0293914105
#> OOB errors NMSE: 0.198775288, 0.5482929344, 0.0307325778, 0.0690612551, 0.1324575808
#> diff. convergence measure: 0.037036172
#> time: 0.26 seconds
#>
#> missForest iteration 3 in progress...done!
#> OOB errors MSE: 0.1411053193, 0.1021741354, 0.086530287, 0.043991304, 0.0292682927
#> OOB errors NMSE: 0.2051151843, 0.552738946, 0.0290107001, 0.0740594344, 0.1319027285
#> diff. convergence measure: -0.0027014715
#> time: 0.13 seconds
# store imputed dataframe
iris_train_imp <- iris_train_imp_object$ximp
iris_train_imp_object$ximp <- NULL
# re-impute the same dataframe using missForestPredict
iris_train_imp_2 <- missForestPredict::missForestPredict(iris_train_imp_object,
newdata = iris_train_miss)
identical(iris_train_imp, iris_train_imp_2)
#> [1] TRUE
num.trees
Although missForestPredict
benefits of the improved
computation time of ranger
package, larger dataset can
still prove time consuming to impute.
We will load the diamonds dataset, which contains more than 50000 observations and produce 30% missing values on each variable.
library(ggplot2)
data(diamonds)
# split train / test
N <- nrow(diamonds)
n_test <- floor(N/3)
set.seed(2022)
id_test <- sample(1:N, n_test)
diamonds_train <- diamonds[-id_test,]
diamonds_test <- diamonds[id_test,]
diamonds_train_miss <- produce_NA(diamonds_train, proportion = 0.1)
diamonds_test_miss <- produce_NA(diamonds_test, proportion = 0.1)
head(diamonds_train_miss)
#> # A tibble: 6 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.23 Ideal <NA> SI2 61.5 55 326 3.95 3.98 2.43
#> 2 0.21 Premium E <NA> 59.8 61 326 3.89 3.84 2.31
#> 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 0.29 Premium I <NA> 62.4 58 334 4.2 4.23 2.63
#> 5 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 6 0.24 Very Good I VVS1 NA 57 336 3.95 3.98 2.47
head(diamonds_test_miss)
#> # A tibble: 6 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.33 Ideal <NA> <NA> 61.9 57 1312 4.43 4.46 2.75
#> 2 0.3 Ideal H SI1 62.6 53.1 475 NA 4.3 2.69
#> 3 0.4 Ideal I VVS2 62 56 1240 4.74 4.77 2.95
#> 4 0.73 Ideal F SI1 61.7 55 3249 5.79 5.82 3.58
#> 5 0.3 Premium E SI2 NA 58 540 4.31 4.28 2.66
#> 6 1.01 Fair F VS2 64.8 NA 4791 6.3 6.25 4.07
The function missForest
supports additional parameters
to be passed to the ranger
function. By default, the
default values of ranger
are used (e.g. the number of trees
in the forest is 500). This can be overridden by passing
num.trees = 100
. Using less trees will prove to be
computationally more efficient.
set.seed(2022)
diamonds_train_imp_object <- missForestPredict::missForest(diamonds_train_miss,
save_models = TRUE,
num.trees = 100,
num.threads = 2)
#> Imputation sequence (missing proportion): carat (0.1) cut (0.1) color (0.1) clarity (0.1) depth (0.1) table (0.1) price (0.1) x (0.1) y (0.1) z (0.1)
#> missForest iteration 1 in progress...done!
#> OOB errors MSE: 0.0007423176, 0.0755630292, 0.0981515212, 0.0683257445, 0.3921561624, 2.3368483435, 351214.440378606, 0.0058204377, 0.0243012546, 0.0073783223
#> OOB errors NMSE: 0.0033187873, 0.5276536603, 0.8186435321, 0.6643940367, 0.1907520008, 0.4713497545, 0.0219643927, 0.0046172615, 0.019334449, 0.0153009953
#> diff. convergence measure: 0.726267113
#> time: 30.11 seconds
#>
#> missForest iteration 2 in progress...done!
#> OOB errors MSE: 0.0003850486, 0.0638179803, 0.0886416713, 0.0602309638, 0.1824754476, 2.2729623355, 316767.391799821, 0.0043671144, 0.0239301468, 0.007432586
#> OOB errors NMSE: 0.001721493, 0.4456384457, 0.7393255851, 0.5856810408, 0.0887594282, 0.4584637432, 0.019810129, 0.0034643631, 0.0190391899, 0.0154135261
#> diff. convergence measure: 0.0360011926
#> time: 34.66 seconds
#>
#> missForest iteration 3 in progress...done!
#> OOB errors MSE: 0.0003799471, 0.0630639581, 0.0877946221, 0.0598951923, 0.1867112833, 2.2598167222, 320242.830123744, 0.0056301098, 0.0252573289, 0.0073136525
#> OOB errors NMSE: 0.0016986846, 0.4403731386, 0.7322606782, 0.5824160259, 0.0908198169, 0.4558122311, 0.0200274774, 0.0044662774, 0.0200951162, 0.0151668847
#> diff. convergence measure: 0.0014180613
#> time: 34.78 seconds
#>
#> missForest iteration 4 in progress...done!
#> OOB errors MSE: 0.0003519875, 0.0628745507, 0.0878521041, 0.0601606389, 0.1861817272, 2.2590258107, 320065.558576817, 0.0045284449, 0.0243025288, 0.0075103301
#> OOB errors NMSE: 0.0015736818, 0.4390505143, 0.7327401137, 0.5849972073, 0.090562231, 0.4556527018, 0.0200163911, 0.003592344, 0.0193354627, 0.0155747501
#> diff. convergence measure: 4.0933e-06
#> time: 34.17 seconds
#>
#> missForest iteration 5 in progress...done!
#> OOB errors MSE: 0.0003756882, 0.0630494168, 0.0879549408, 0.0599350225, 0.189900218, 2.2572089337, 319680.371423409, 0.0046050811, 0.0244964369, 0.0076207539
#> OOB errors NMSE: 0.0016796439, 0.4402715969, 0.7335978339, 0.5828033312, 0.0923709736, 0.4552862319, 0.0199923021, 0.0036531383, 0.019489739, 0.0158037443
#> diff. convergence measure: -0.0001853137
#> time: 35.34 seconds
# impute test set
diamonds_train_imp_object$ximp <- NULL
diamonds_test_imp <- missForestPredict::missForestPredict(diamonds_train_imp_object,
newdata = diamonds_test_miss)
head(diamonds_test_imp)
#> # A tibble: 6 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.33 Ideal E IF 61.9 57 1312 4.43 4.46 2.75
#> 2 0.3 Ideal H SI1 62.6 53.1 475 4.28 4.3 2.69
#> 3 0.4 Ideal I VVS2 62 56 1240 4.74 4.77 2.95
#> 4 0.73 Ideal F SI1 61.7 55 3249 5.79 5.82 3.58
#> 5 0.3 Premium E SI2 62.1 58 540 4.31 4.28 2.66
#> 6 1.01 Fair F VS2 64.8 56.8 4791 6.3 6.25 4.07
Alternatively, the maxiter
parameter can be set to a
lower number (the default is 10) or other ranger
parameters
can be adapted. Note that not all parameters to ranger function are
supported. Parameters that should be adapted in function of the imputed
variable (outcome for the ranger function) are not supported. Generic
parameters that can be applied to all variables are supported (like:
num.trees, mtry, min.node.size, max.depth, replace,
…), as
well as class.weights
for factor variables.
The parameter initialization
supports three values:
mean/mode
, median/mode
and
custom
. The default is mean/mode, which will initialize
each variable with the mean (for continuous variables) or mode (for
categorical variables) calculated based on the complete observations on
that variable.
When custom
is used, a complete dataframe is expected in
x_init
. For example, the imputations of another imputation
method can be used as initialization. When custom
is used
in missForest
function, an initialization dataframe has to
be passed to the missForestPredict
function later on
too.
We will exemplify this with an initialization using linear models on
the iris
dataset. Let’s assume that variables
Sepal.Width
and Species
are known to be never
missing. We will regress the other variables on these two variables and
save the linear models to be applied on the test set afterwards.
data(iris)
# split train / test
N <- nrow(iris)
n_test <- floor(N/3)
set.seed(2022)
id_test <- sample(1:N, n_test)
iris_train <- iris[-id_test,]
iris_test <- iris[id_test,]
# produce missing values
set.seed(2022)
iris_train_miss <- produce_NA(iris_train, proportion = c(0.2, 0, 0.2, 0.2, 0))
iris_test_miss <- produce_NA(iris_test, proportion = c(0.2, 0, 0.2, 0.2, 0))
# build linear models for Sepal.Length, Petal.Width, Petal.Length using complete cases for each variable
fit_1 <- lm(Sepal.Length ~ ., data = iris_train_miss[!is.na(iris_train_miss$Sepal.Length),
c("Sepal.Length", "Sepal.Width", "Species")])
fit_2 <- lm(Petal.Width ~ ., data = iris_train_miss[!is.na(iris_train_miss$Petal.Width),
c("Petal.Width", "Sepal.Width", "Species")])
fit_3 <- lm(Petal.Length ~ ., data = iris_train_miss[!is.na(iris_train_miss$Petal.Length),
c("Petal.Length", "Sepal.Width", "Species")])
# impute training with predictions of linear model
iris_train_init <- iris_train_miss
iris_train_init$Sepal.Length[is.na(iris_train_init$Sepal.Length)] <-
predict(fit_1, iris_train_init[is.na(iris_train_init$Sepal.Length), c("Sepal.Width", "Species")])
iris_train_init$Petal.Width[is.na(iris_train_init$Petal.Width)] <-
predict(fit_2, iris_train_init[is.na(iris_train_init$Petal.Width), c("Sepal.Width", "Species")])
iris_train_init$Petal.Length[is.na(iris_train_init$Petal.Length)] <-
predict(fit_3, iris_train_init[is.na(iris_train_init$Petal.Length), c("Sepal.Width", "Species")])
# impute the training set using this initialization
set.seed(2022)
iris_train_imp_obj <- missForest(iris_train_miss,
save_models = TRUE,
initialization = "custom",
x_init = iris_train_init,
num.threads = 2)
#> Imputation sequence (missing proportion): Sepal.Width (0) Species (0) Sepal.Length (0.2) Petal.Length (0.2) Petal.Width (0.2)
#> missForest iteration 1 in progress...done!
#> OOB errors MSE: 0.0729457458, 0.023224213, 0.164375904, 0.1299985753, 0.0364212886
#> OOB errors NMSE: 0.4068272086, 0.104550779, 0.2563766732, 0.0430637102, 0.0632384372
#> diff. convergence measure: 0.8791070598
#> time: 0.11 seconds
#>
#> missForest iteration 2 in progress...done!
#> OOB errors MSE: 0.0889019494, 0.0286355472, 0.1397322375, 0.0891585948, 0.0356435945
#> OOB errors NMSE: 0.4958168777, 0.1289115272, 0.2179400101, 0.0295349382, 0.0618881236
#> diff. convergence measure: 0.0177719163
#> time: 0.13 seconds
#>
#> missForest iteration 3 in progress...done!
#> OOB errors MSE: 0.0951034048, 0.0278894294, 0.1464458736, 0.0923955305, 0.0354256936
#> OOB errors NMSE: 0.5304031411, 0.1255526535, 0.228411251, 0.0306072151, 0.0615097813
#> diff. convergence measure: -0.0037217251
#> time: 0.11 seconds
# build test set initialization using the linear models learned on training
iris_test_init <- iris_test_miss
iris_test_init$Sepal.Length[is.na(iris_test_init$Sepal.Length)] <-
predict(fit_1, iris_test_init[is.na(iris_test_init$Sepal.Length), c("Sepal.Width", "Species")])
iris_test_init$Petal.Width[is.na(iris_test_init$Petal.Width)] <-
predict(fit_2, iris_test_init[is.na(iris_test_init$Petal.Width), c("Sepal.Width", "Species")])
iris_test_init$Petal.Length[is.na(iris_test_init$Petal.Length)] <-
predict(fit_3, iris_test_init[is.na(iris_test_init$Petal.Length), c("Sepal.Width", "Species")])
# impute test set
iris_test_imp <- missForestPredict(iris_train_imp_obj, newdata = iris_test_miss,
x_init = iris_test_init)
evaluate_imputation_error(iris_test_imp, iris_test_miss, iris_test)
#> variable MSE NMSE MER macro_F1 F1_score
#> 1 Petal.Length 0.03452000 0.01345075 NA NA NA
#> 2 Petal.Width 0.04080528 0.12767611 NA NA NA
#> 3 Sepal.Length 0.08041247 0.16808626 NA NA NA
#> 4 Sepal.Width 0.00000000 0.00000000 NA NA NA
#> 5 Species NA NA 0 0 0
evaluate_imputation_error(iris_test_init, iris_test_miss, iris_test)
#> variable MSE NMSE MER macro_F1 F1_score
#> 1 Petal.Length 0.07731568 0.03012612 NA NA NA
#> 2 Petal.Width 0.03350469 0.10483320 NA NA NA
#> 3 Sepal.Length 0.10322009 0.21576106 NA NA NA
#> 4 Sepal.Width 0.00000000 0.00000000 NA NA NA
#> 5 Species NA NA 0 0 0
The errors (MSE, NMSE) for Sepal.Length
and
Petal.Length
decreased compared to the initialization
imputation. The missclasification error (MER) for Species
is 0 (if a variable does not have missing values, the error returned
will be 0).
Also keep in mind that x_init
should be complete
dataframe, if used. For example, if you would wish to impute only one
variable using a different method as initialization, you cannot rely on
the missForest
function to do mean/mode initialization for
the other variables. You would have to do it yourself before providing
x_init
.
predictor_matrix
By default, missForest will create models for all variables in a
dataframe and will use all other variables in the imputation of each
variable. Controlling which variables to impute and which predictors to
use for the imputation of each variable can be done using the parameter
predictor_matrix
. We illustrate three different cases:
variables not imputed and not used as predictor
variables not imputed but used as predictor
variables imputed but not used as predictor
We consider the case when a date variable is kept in the dataframe for further reference. Such variable does not need imputation, nor can be used as predictor.
data(iris)
iris$Date_collected <- seq(Sys.Date() - nrow(iris) + 1, Sys.Date(), by="days")
# split train / test
N <- nrow(iris)
n_test <- floor(N/3)
set.seed(2022)
id_test <- sample(1:N, n_test)
iris_train <- iris[-id_test,]
iris_test <- iris[id_test,]
iris_train_miss <- produce_NA(iris_train, proportion = c(0.1,0.1,0.1,0.1,0.1,0))
iris_test_miss <- produce_NA(iris_test, proportion = c(0.1,0.1,0.1,0.1,0.1,0))
head(iris_train_miss)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Date_collected
#> 2 4.9 3.0 1.4 0.2 setosa 2023-07-16
#> 4 4.6 3.1 1.5 0.2 setosa 2023-07-18
#> 5 5.0 3.6 1.4 0.2 setosa 2023-07-19
#> 8 5.0 3.4 1.5 0.2 setosa 2023-07-22
#> 9 NA 2.9 1.4 0.2 setosa 2023-07-23
#> 10 4.9 NA 1.5 0.1 setosa 2023-07-24
We will create a predictor matrix so that the
Date_collected
variable is not included in imputation and
not included as predictor. We will initialize the predictor matrix using
the function create_predictor_matrix
. This is the default
predictor matrix in which all variables are included for imputation and
all other variables are used as predictors. The diagonal is 0 as one
variable will not be used as predictor for its own imputation model.
predictor_matrix <- create_predictor_matrix(iris_train_miss)
print(predictor_matrix)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Sepal.Length 0 1 1 1 1
#> Sepal.Width 1 0 1 1 1
#> Petal.Length 1 1 0 1 1
#> Petal.Width 1 1 1 0 1
#> Species 1 1 1 1 0
#> Date_collected 1 1 1 1 1
#> Date_collected
#> Sepal.Length 1
#> Sepal.Width 1
#> Petal.Length 1
#> Petal.Width 1
#> Species 1
#> Date_collected 0
The rows control the variables to be imputed and the columns control
the variables as predictors. We will set all zeros for
Date_collected
both row-wise and column-wise.
predictor_matrix["Date_collected",] <- 0
predictor_matrix[,"Date_collected"] <- 0
print(predictor_matrix)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Sepal.Length 0 1 1 1 1
#> Sepal.Width 1 0 1 1 1
#> Petal.Length 1 1 0 1 1
#> Petal.Width 1 1 1 0 1
#> Species 1 1 1 1 0
#> Date_collected 0 0 0 0 0
#> Date_collected
#> Sepal.Length 0
#> Sepal.Width 0
#> Petal.Length 0
#> Petal.Width 0
#> Species 0
#> Date_collected 0
We will pass this matrix to the missForest function.
set.seed(2022)
iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, save_models = TRUE,
predictor_matrix = predictor_matrix,
verbose = TRUE,
num.threads = 2)
#> Variables skipped from imputation (missing proportion): Date_collected (0)
#> Imputation sequence (missing proportion): Sepal.Length (0.1) Sepal.Width (0.1) Petal.Length (0.1) Petal.Width (0.1) Species (0.1)
#> missForest iteration 1 in progress...done!
#> OOB errors MSE: 0.1341044569, 0.1040272501, 0.1722004553, 0.0452709963, 0.0283471652
#> OOB errors NMSE: 0.1937151536, 0.5888499508, 0.0550696408, 0.0747929897, 0.1276095062
#> diff. convergence measure: 0.7919925518
#> time: 0.12 seconds
#>
#> missForest iteration 2 in progress...done!
#> OOB errors MSE: 0.1319444727, 0.0949470098, 0.0762417102, 0.0421574795, 0.0298083803
#> OOB errors NMSE: 0.1905950361, 0.5374509276, 0.0243820702, 0.0696490951, 0.1341874102
#> diff. convergence measure: 0.0167545404
#> time: 0.13 seconds
#>
#> missForest iteration 3 in progress...done!
#> OOB errors MSE: 0.1283661665, 0.0956347547, 0.0736066676, 0.0421326132, 0.0303473639
#> OOB errors NMSE: 0.1854261389, 0.541343932, 0.0235393846, 0.0696080131, 0.1366137353
#> diff. convergence measure: -5.3333e-05
#> time: 0.14 seconds
iris_train_imp <- iris_train_imp_object$ximp
head(iris_train_imp)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Date_collected
#> 2 4.900000 3.000000 1.4 0.2 setosa 2023-07-16
#> 4 4.600000 3.100000 1.5 0.2 setosa 2023-07-18
#> 5 5.000000 3.600000 1.4 0.2 setosa 2023-07-19
#> 8 5.000000 3.400000 1.5 0.2 setosa 2023-07-22
#> 9 4.687727 2.900000 1.4 0.2 setosa 2023-07-23
#> 10 4.900000 3.107115 1.5 0.1 setosa 2023-07-24
iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object,
newdata = iris_test_miss)
head(iris_test_imp)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Date_collected
#> 55 6.500000 2.800000 4.695463 1.5 versicolor 2023-09-07
#> 75 5.887097 2.900000 4.300000 1.3 versicolor 2023-09-27
#> 6 5.400000 3.606048 1.700000 0.4 setosa 2023-07-20
#> 123 7.700000 3.375697 6.700000 2.0 virginica 2023-11-14
#> 14 4.300000 3.000000 1.358195 0.1 setosa 2023-07-28
#> 7 4.600000 3.400000 1.400000 0.3 setosa 2023-07-21
We consider the case when we do not want to impute the variable
Sepal.Length
(for example, we do not plan to use it further
in a prediction model), but we want to use it as predictor for the
imputation of other variables.
data(iris)
# split train / test
N <- nrow(iris)
n_test <- floor(N/3)
set.seed(2022)
id_test <- sample(1:N, n_test)
iris_train <- iris[-id_test,]
iris_test <- iris[id_test,]
iris_train_miss <- produce_NA(iris_train, proportion = 0.1)
iris_test_miss <- produce_NA(iris_test, proportion = 0.1)
head(iris_train_miss)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 2 4.9 3.0 1.4 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 NA 2.9 1.4 0.2 setosa
#> 10 4.9 NA 1.5 0.1 setosa
We create the predictor matrix and set zeros row-wise. The row
represents the variable to be imputed, the columns represent the
predictors used in imputation. Setting zero row-wise means: do not use
any of the predictors in imputing Sepal.Length
. This
situation is interpreted as: do not build imputation models for
Sepal.Length
.
predictor_matrix <- create_predictor_matrix(iris_train_miss)
predictor_matrix["Sepal.Length",] <- 0
print(predictor_matrix)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Sepal.Length 0 0 0 0 0
#> Sepal.Width 1 0 1 1 1
#> Petal.Length 1 1 0 1 1
#> Petal.Width 1 1 1 0 1
#> Species 1 1 1 1 0
set.seed(2022)
iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, save_models = TRUE,
predictor_matrix = predictor_matrix,
verbose = TRUE,
num.threads = 2)
#> Variables skipped from imputation (missing proportion): Sepal.Length (0.1)
#> Imputation sequence (missing proportion): Sepal.Width (0.1) Petal.Length (0.1) Petal.Width (0.1) Species (0.1)
#> missForest iteration 1 in progress...done!
#> OOB errors MSE: 0.1117129172, 0.1827309477, 0.0486749348, 0.0294848518
#> OOB errors NMSE: 0.6323549429, 0.0584372883, 0.080416695, 0.1327309928
#> diff. convergence measure: 0.7740150202
#> time: 0.09 seconds
#>
#> missForest iteration 2 in progress...done!
#> OOB errors MSE: 0.106227064, 0.0768522608, 0.039277068, 0.0310534506
#> OOB errors NMSE: 0.6013020759, 0.0245773241, 0.064890318, 0.1397923028
#> diff. convergence measure: 0.0183444746
#> time: 0.11 seconds
#>
#> missForest iteration 3 in progress...done!
#> OOB errors MSE: 0.1033715293, 0.0743138913, 0.0413143614, 0.0305141296
#> OOB errors NMSE: 0.5851382198, 0.0237655545, 0.0682561654, 0.1373644589
#> diff. convergence measure: 0.0040094055
#> time: 0.1 seconds
#>
#> missForest iteration 4 in progress...done!
#> OOB errors MSE: 0.1059414378, 0.0725507058, 0.0407121795, 0.0289191544
#> OOB errors NMSE: 0.5996852785, 0.0232016884, 0.0672612903, 0.1301844111
#> diff. convergence measure: -0.0014520674
#> time: 0.11 seconds
iris_train_imp <- iris_train_imp_object$ximp
iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object,
newdata = iris_test_miss)
head(iris_test_imp)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 55 6.500000 2.800000 4.689202 1.5 versicolor
#> 75 5.848889 2.900000 4.300000 1.3 versicolor
#> 6 5.400000 3.599842 1.700000 0.4 setosa
#> 123 7.700000 3.364393 6.700000 2.0 virginica
#> 14 4.300000 3.000000 1.345240 0.1 setosa
#> 7 4.600000 3.400000 1.400000 0.3 setosa
Models are not built for the variable Sepal.Length
.
Nevertheless this variable is initialized (with mean imputation), as it
is used in other imputation models. As this variable is not imputed
iteratively, its mean imputation might not be optimal for it being used
as predictor. We recommend using this case only for complete variables
in the training data and no expected missing values in future data.
A third case is that when we want to impute a variable (based on all
other variables), but we don’t want to use it as predictor for specific
variables. For example we do not want to use the Species
variable in the imputation of Petal.Length
and
Petal.Width
.
data(iris)
# split train / test
N <- nrow(iris)
n_test <- floor(N/3)
set.seed(2022)
id_test <- sample(1:N, n_test)
iris_train <- iris[-id_test,]
iris_test <- iris[id_test,]
iris_train_miss <- produce_NA(iris_train, proportion = 0.1)
iris_test_miss <- produce_NA(iris_test, proportion = 0.1)
head(iris_train_miss)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 2 4.9 3.0 1.4 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 NA 2.9 1.4 0.2 setosa
#> 10 4.9 NA 1.5 0.1 setosa
predictor_matrix <- create_predictor_matrix(iris_train_miss)
predictor_matrix["Petal.Length","Species"] <- 0
predictor_matrix["Petal.Width","Species"] <- 0
print(predictor_matrix)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Sepal.Length 0 1 1 1 1
#> Sepal.Width 1 0 1 1 1
#> Petal.Length 1 1 0 1 0
#> Petal.Width 1 1 1 0 0
#> Species 1 1 1 1 0
set.seed(2022)
iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, save_models = TRUE,
predictor_matrix = predictor_matrix,
verbose = TRUE,
num.threads = 2)
#> Imputation sequence (missing proportion): Sepal.Length (0.1) Sepal.Width (0.1) Petal.Length (0.1) Petal.Width (0.1) Species (0.1)
#> missForest iteration 1 in progress...done!
#> OOB errors MSE: 0.1341044569, 0.1040272501, 0.2179548815, 0.0606276203, 0.0315971841
#> OOB errors NMSE: 0.1937151536, 0.5888499508, 0.0697018891, 0.1001639317, 0.14224001
#> diff. convergence measure: 0.781065813
#> time: 0.14 seconds
#>
#> missForest iteration 2 in progress...done!
#> OOB errors MSE: 0.1303642673, 0.0916584987, 0.1635653256, 0.0630372602, 0.033080005
#> OOB errors NMSE: 0.1883124145, 0.5188361933, 0.0523081296, 0.1041449391, 0.1489151765
#> diff. convergence measure: 0.0164308165
#> time: 0.12 seconds
#>
#> missForest iteration 3 in progress...done!
#> OOB errors MSE: 0.1332057859, 0.0976123121, 0.1561038932, 0.0596154603, 0.0324109161
#> OOB errors NMSE: 0.1924170148, 0.552537966, 0.0499219663, 0.0984917248, 0.1459031607
#> diff. convergence measure: -0.0053509959
#> time: 0.16 seconds
iris_train_imp <- iris_train_imp_object$ximp
iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object,
newdata = iris_test_miss)
head(iris_test_imp)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 55 6.50000 2.800000 4.824410 1.5 versicolor
#> 75 5.89133 2.900000 4.300000 1.3 versicolor
#> 6 5.40000 3.596408 1.700000 0.4 setosa
#> 123 7.70000 3.374780 6.700000 2.0 virginica
#> 14 4.30000 3.000000 1.380106 0.1 setosa
#> 7 4.60000 3.400000 1.400000 0.3 setosa
A predictor matrix can be checked using the function
check_predictor_matrix
. In case the predictor matrix is not
valid, an error will be returned. For valid predictor matrices, human
readable message about the imputation scheme will be printed.
check_predictor_matrix(predictor_matrix, iris_train)
#> GREAT! All checks passed succesfully.
#> Variable Sepal.Length will be imputed using all other variables.
#> Variable Sepal.Width will be imputed using all other variables.
#> Variable Petal.Length will be imputed using Sepal.Length, Sepal.Width, Petal.Width.
#> Variable Petal.Width will be imputed using Sepal.Length, Sepal.Width, Petal.Length.
#> Variable Species will be imputed using all other variables.
proportion_usable_cases
on the predictor
matrixThe default predictor matrix can be altered by the setting of
proportion_usable_cases
.
missForest initializes each variable and then builds models for each
variable using the observed values of that variable as outcome of a
random forest model. It then imputes the missing part of the variable
using the learned models. Situations when a variable is mostly or
completely missing for the observed part of another variable are
suboptimal for learning the models, as the models will mostly rely on
the initialization rather than on true values. Situations when a
variable is missing for the missing part of another variable are
suboptimal for the imputation (prediction) part as the predictions will
rely on the initialization rather than on true values. By default
proportion_usable_cases
will filter the variables (as
predictors) if they are completely missing for the observed or missing
part of another variable.
As an example, we will make Sepal.Length
and
Sepal.Width
in iris
missing together and
Petal.Length
missing for the observed part of these
variables. Checking the predictor matrix post imputation will reveal
that these variables will not be used as predictor for each other
because of the “sub-optimal” missingness pattern.
data(iris)
iris_mis <- iris
iris_mis[1:50, "Sepal.Length"] <- NA
iris_mis[1:50, "Sepal.Width"] <- NA
iris_mis[51:150, "Petal.Length"] <- NA
set.seed(2022)
iris_train_imp_object <- missForestPredict::missForest(iris_mis, save_models = TRUE,
verbose = TRUE,
num.threads = 2)
#> Imputation sequence (missing proportion): Petal.Width (0) Species (0) Sepal.Length (0.333333333333333) Sepal.Width (0.333333333333333) Petal.Length (0.666666666666667)
#> missForest iteration 1 in progress...done!
#> OOB errors MSE: 0.030348177, 0.0289622723, 0.3109868031, 0.078778252, 0.0294038219
#> OOB errors NMSE: 0.0525843832, 0.1303302253, 0.7149845113, 0.7186747556, 0.9948511947
#> diff. convergence measure: 0.144159586
#> time: 0.09 seconds
#>
#> missForest iteration 2 in progress...done!
#> OOB errors MSE: 0.0305443675, 0.0291665594, 0.315261906, 0.0790942654, 0.0295874733
#> OOB errors NMSE: 0.0529243232, 0.1312495175, 0.7248133283, 0.7215576688, 1.0010648692
#> diff. convergence measure: -0.0062847698
#> time: 0.11 seconds
print(iris_train_imp_object$predictor_matrix)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Sepal.Length 0 0 0 1 1
#> Sepal.Width 0 0 0 1 1
#> Petal.Length 0 0 0 1 1
#> Petal.Width 1 1 1 0 1
#> Species 1 1 1 1 0
check_predictor_matrix(iris_train_imp_object$predictor_matrix, iris_mis, verbose = TRUE)
#> GREAT! All checks passed succesfully.
#> Variable Sepal.Length will be imputed using Petal.Width, Species.
#> Variable Sepal.Width will be imputed using Petal.Width, Species.
#> Variable Petal.Length will be imputed using Petal.Width, Species.
#> Variable Petal.Width will be imputed using all other variables.
#> Variable Species will be imputed using all other variables.
The package missForestPredict
is based on the
missForest
package (Stekhoven and
Bühlmann 2012) with additional functionality for convergence
criteria, initialization, predictor specification and imputation of new
observations.
Other imputation packages based on random forests are available in
the R ecosystem, using iterative algorithms, like
missRanger
(Mayer and Mayer
2019), mice
(Van Buuren,
Oudshoorn, and Jong 2007), CALIBERrfimpute
(Shah et al. 2021), miceRanger
(Wilson 2020). These provide various
extended functionality, but they do not support the imputation of new
observations, with the exception of miceRanger
(Wilson 2020) which, to our knowledge, is the
only iterative algorithm that can impute new observations. Non-iterative
algorithms are also available: the function rfImpute
in the
randomForest
(Liaw and Wiener
2002) package requires presence of the outcome at imputation
time, which makes it inadequate for prediction settings (where
imputation for a single new observation for which the outcome is not yet
know may be of interest); caret
(Kuhn et al. 2008) implements bagged trees
without iterations and will impute new observations for most missingness
patterns; imputeMissings
implements random forests
non-iteratively but the initialization is not learned on the training
set and will require a test set for initialization.
The R package missForestPredict
uses an iterative
algorithm and can impute a single new observation in (applied)
prediction settings, even when the new observation exhibits a
missingness pattern not encountered in the training set. It is therefore
suitable for (applied) prediction settings.