How to use the package

Data used for demonstration

Iris data The iris dataset in R base contains 4 continuous variables and one categorical variable with three categories for N = 150 flowers (Anderson 1935).

Diamonds data The diamonds dataset from ggplot2 R package contains seven continuous variables and three categorical variables for N = 53940 diamonds (Wickham 2016).

Imputation of training and test set

After installing the package you can load it in your R sessions with:

library(missForestPredict)

We will load the iris dataset and split it in a training set (100 observations) and a test set (50 observations).

data(iris)

N <- nrow(iris)
n_test <- floor(N/3)

set.seed(2022)
id_test <- sample(1:N, n_test)

iris_train <- iris[-id_test,]
iris_test <- iris[id_test,]

We produce 10% random missing values on each column in both the training and the test set.

set.seed(2022)
iris_train_miss <- produce_NA(iris_train, proportion = 0.1)
iris_test_miss <- produce_NA(iris_test, proportion = 0.1)

head(iris_train_miss)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 2           4.9          NA          1.4         0.2    <NA>
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4          NA    <NA>
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10           NA         3.1          1.5         0.1  setosa
head(iris_test_miss)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 55           6.5         2.8          4.6         1.5 versicolor
#> 75           6.4         2.9           NA         1.3 versicolor
#> 6            5.4         3.9          1.7         0.4     setosa
#> 123          7.7         2.8          6.7         2.0  virginica
#> 14           4.3          NA          1.1          NA     setosa
#> 7            4.6         3.4          1.4         0.3       <NA>

We will impute the training set and learn the random forest imputation models at the same time using the function missForest. To later use the models learned on a training set for imputation of new observations, save_models needs to be set to TRUE.

By default feedback on the number of iterations and the error monitoring is provided. You can set verbose = FALSE to silence this output. More information on error monitoring is provided in a separate vignette.

Please note that in all following examples we set the ranger parameter num.threads to 2. If you are running the code on a machine with more cores and you are willing to use them for running the code, you can remove this parameter completely.

set.seed(2022)
iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, 
                                                       save_models = TRUE, 
                                                       num.threads = 2)
#> Imputation sequence (missing proportion):  Sepal.Length (0.1) Sepal.Width (0.1) Petal.Length (0.1) Petal.Width (0.1) Species (0.1) 
#>   missForest iteration 1 in progress...done!
#>     OOB errors MSE:            0.1791016514, 0.1136759079, 0.205465453, 0.0486713185, 0.0307028438
#>     OOB errors NMSE:           0.2603478624, 0.6149609322, 0.0688856683, 0.0819382466, 0.1383677864
#>     diff. convergence measure: 0.7670999008
#>     time:                      0.13 seconds
#> 
#>   missForest iteration 2 in progress...done!
#>     OOB errors MSE:            0.136743901, 0.1013522874, 0.0916661356, 0.0410223856, 0.0293914105
#>     OOB errors NMSE:           0.198775288, 0.5482929344, 0.0307325778, 0.0690612551, 0.1324575808
#>     diff. convergence measure: 0.037036172
#>     time:                      0.14 seconds
#> 
#>   missForest iteration 3 in progress...done!
#>     OOB errors MSE:            0.1411053193, 0.1021741354, 0.086530287, 0.043991304, 0.0292682927
#>     OOB errors NMSE:           0.2051151843, 0.552738946, 0.0290107001, 0.0740594344, 0.1319027285
#>     diff. convergence measure: -0.0027014715
#>     time:                      0.12 seconds

The imputed training set can be found by extracting ximp dataframe from the object.

iris_train_imp <- iris_train_imp_object$ximp

head(iris_train_imp)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 2      4.900000    3.313362          1.4    0.200000  setosa
#> 4      4.600000    3.100000          1.5    0.200000  setosa
#> 5      5.000000    3.600000          1.4    0.256828  setosa
#> 8      5.000000    3.400000          1.5    0.200000  setosa
#> 9      4.400000    2.900000          1.4    0.200000  setosa
#> 10     4.774814    3.100000          1.5    0.100000  setosa

We will further impute the test set using the learned imputation models. The function missForestPredict will:

initialize the missing values in each variable with the initialization “learned” from the training set (mean/mode)
imperatively predict the missing values of each variable using the learned random forest models for each iteration

iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object, 
                                                      newdata = iris_test_miss)

head(iris_test_imp)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 55           6.5     2.80000      4.60000   1.5000000 versicolor
#> 75           6.4     2.90000      4.42546   1.3000000 versicolor
#> 6            5.4     3.90000      1.70000   0.4000000     setosa
#> 123          7.7     2.80000      6.70000   2.0000000  virginica
#> 14           4.3     3.13961      1.10000   0.2106567     setosa
#> 7            4.6     3.40000      1.40000   0.3000000     setosa

Imputation of a single new observation

missForestPredict can impute a new observation with missing values. The new observation has to be provided as a dataframe with one row and with named columns (the column names have to correspond to the column names of the training set).

single_observation <- iris_test_miss[1,]
single_observation[1,2] <- NA

print(single_observation)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 55          6.5          NA          4.6         1.5 versicolor

single_observation_imp <- missForestPredict::missForestPredict(iris_train_imp_object, 
                                                               newdata = single_observation)

print(single_observation_imp)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 55          6.5     3.00739          4.6         1.5 versicolor

missForestPredict package can impute observations with new missingness patterns not present in the training data as well as variables with no missingness (complete) in training data. The initialization and the random forest imputation models are “learned” for all variables in the dataset, regardless of the amount of missingness.

Predict without storing the imputed matrix

The function missForest returns both the imputed training dataframe as well as the imputation models.

str(iris_train_imp_object, max.level = 1)
#> List of 11
#>  $ ximp                     :'data.frame':   100 obs. of  5 variables:
#>  $ init                     :List of 5
#>  $ initialization           : chr "mean/mode"
#>  $ impute_sequence          : chr [1:5] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ...
#>  $ maxiter                  : num 10
#>  $ save_models              : logi TRUE
#>  $ models                   :List of 3
#>  $ return_integer_as_integer: logi FALSE
#>  $ integer_columns          : chr(0) 
#>  $ predictor_matrix         : num [1:5, 1:5] 0 1 1 1 1 1 0 1 1 1 ...
#>   ..- attr(*, "dimnames")=List of 2
#>  $ OOB_err                  :'data.frame':   55 obs. of  7 variables:
#>  - attr(*, "class")= chr "missForest"

The imputed training data is though not necessary for imputing the test set. To avoid storing further these data in the object, ximp can be set to NULL.

iris_train_imp <- iris_train_imp_object$ximp
iris_train_imp_object$ximp <- NULL 

iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object, 
                                                      newdata = iris_test_miss)

head(iris_test_imp)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 55           6.5     2.80000      4.60000   1.5000000 versicolor
#> 75           6.4     2.90000      4.42546   1.3000000 versicolor
#> 6            5.4     3.90000      1.70000   0.4000000     setosa
#> 123          7.7     2.80000      6.70000   2.0000000  virginica
#> 14           4.3     3.13961      1.10000   0.2106567     setosa
#> 7            4.6     3.40000      1.40000   0.3000000     setosa

In ximp is set to NULL by mistake, and the training set imputation is lost, it can be recovered without rerunning the algorithm. Imputing the training set with the function missForestPredict will give the same results as the initial results of missForest function because the same models are applied to the variables in the same order and for the same number of iterations.

set.seed(2022)
iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, 
                                                       save_models = TRUE,
                                                       num.threads = 2)
#> Imputation sequence (missing proportion):  Sepal.Length (0.1) Sepal.Width (0.1) Petal.Length (0.1) Petal.Width (0.1) Species (0.1) 
#>   missForest iteration 1 in progress...done!
#>     OOB errors MSE:            0.1791016514, 0.1136759079, 0.205465453, 0.0486713185, 0.0307028438
#>     OOB errors NMSE:           0.2603478624, 0.6149609322, 0.0688856683, 0.0819382466, 0.1383677864
#>     diff. convergence measure: 0.7670999008
#>     time:                      0.14 seconds
#> 
#>   missForest iteration 2 in progress...done!
#>     OOB errors MSE:            0.136743901, 0.1013522874, 0.0916661356, 0.0410223856, 0.0293914105
#>     OOB errors NMSE:           0.198775288, 0.5482929344, 0.0307325778, 0.0690612551, 0.1324575808
#>     diff. convergence measure: 0.037036172
#>     time:                      0.26 seconds
#> 
#>   missForest iteration 3 in progress...done!
#>     OOB errors MSE:            0.1411053193, 0.1021741354, 0.086530287, 0.043991304, 0.0292682927
#>     OOB errors NMSE:           0.2051151843, 0.552738946, 0.0290107001, 0.0740594344, 0.1319027285
#>     diff. convergence measure: -0.0027014715
#>     time:                      0.13 seconds

# store imputed dataframe
iris_train_imp <- iris_train_imp_object$ximp
iris_train_imp_object$ximp <- NULL 

# re-impute the same dataframe using missForestPredict
iris_train_imp_2 <-  missForestPredict::missForestPredict(iris_train_imp_object, 
                                                      newdata = iris_train_miss)

identical(iris_train_imp, iris_train_imp_2)
#> [1] TRUE

Impute larger datasets by adapting `num.trees`

Although missForestPredict benefits of the improved computation time of ranger package, larger dataset can still prove time consuming to impute.

We will load the diamonds dataset, which contains more than 50000 observations and produce 30% missing values on each variable.

library(ggplot2)

data(diamonds)

# split train / test
N <- nrow(diamonds)
n_test <- floor(N/3)

set.seed(2022)
id_test <- sample(1:N, n_test)

diamonds_train <- diamonds[-id_test,]
diamonds_test <- diamonds[id_test,]

diamonds_train_miss <- produce_NA(diamonds_train, proportion = 0.1)
diamonds_test_miss <- produce_NA(diamonds_test, proportion = 0.1)

head(diamonds_train_miss)
#> # A tibble: 6 x 10
#>   carat cut       color clarity depth table price     x     y     z
#>   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  0.23 Ideal     <NA>  SI2      61.5    55   326  3.95  3.98  2.43
#> 2  0.21 Premium   E     <NA>     59.8    61   326  3.89  3.84  2.31
#> 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#> 4  0.29 Premium   I     <NA>     62.4    58   334  4.2   4.23  2.63
#> 5  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#> 6  0.24 Very Good I     VVS1     NA      57   336  3.95  3.98  2.47
head(diamonds_test_miss)
#> # A tibble: 6 x 10
#>   carat cut     color clarity depth table price     x     y     z
#>   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  0.33 Ideal   <NA>  <NA>     61.9  57    1312  4.43  4.46  2.75
#> 2  0.3  Ideal   H     SI1      62.6  53.1   475 NA     4.3   2.69
#> 3  0.4  Ideal   I     VVS2     62    56    1240  4.74  4.77  2.95
#> 4  0.73 Ideal   F     SI1      61.7  55    3249  5.79  5.82  3.58
#> 5  0.3  Premium E     SI2      NA    58     540  4.31  4.28  2.66
#> 6  1.01 Fair    F     VS2      64.8  NA    4791  6.3   6.25  4.07

The function missForest supports additional parameters to be passed to the ranger function. By default, the default values of ranger are used (e.g. the number of trees in the forest is 500). This can be overridden by passing num.trees = 100. Using less trees will prove to be computationally more efficient.

set.seed(2022)
diamonds_train_imp_object <- missForestPredict::missForest(diamonds_train_miss,
                                                           save_models = TRUE,
                                                           num.trees = 100,
                                                           num.threads = 2)
#> Imputation sequence (missing proportion):  carat (0.1) cut (0.1) color (0.1) clarity (0.1) depth (0.1) table (0.1) price (0.1) x (0.1) y (0.1) z (0.1) 
#>   missForest iteration 1 in progress...done!
#>     OOB errors MSE:            0.0007423176, 0.0755630292, 0.0981515212, 0.0683257445, 0.3921561624, 2.3368483435, 351214.440378606, 0.0058204377, 0.0243012546, 0.0073783223
#>     OOB errors NMSE:           0.0033187873, 0.5276536603, 0.8186435321, 0.6643940367, 0.1907520008, 0.4713497545, 0.0219643927, 0.0046172615, 0.019334449, 0.0153009953
#>     diff. convergence measure: 0.726267113
#>     time:                      30.11 seconds
#> 
#>   missForest iteration 2 in progress...done!
#>     OOB errors MSE:            0.0003850486, 0.0638179803, 0.0886416713, 0.0602309638, 0.1824754476, 2.2729623355, 316767.391799821, 0.0043671144, 0.0239301468, 0.007432586
#>     OOB errors NMSE:           0.001721493, 0.4456384457, 0.7393255851, 0.5856810408, 0.0887594282, 0.4584637432, 0.019810129, 0.0034643631, 0.0190391899, 0.0154135261
#>     diff. convergence measure: 0.0360011926
#>     time:                      34.66 seconds
#> 
#>   missForest iteration 3 in progress...done!
#>     OOB errors MSE:            0.0003799471, 0.0630639581, 0.0877946221, 0.0598951923, 0.1867112833, 2.2598167222, 320242.830123744, 0.0056301098, 0.0252573289, 0.0073136525
#>     OOB errors NMSE:           0.0016986846, 0.4403731386, 0.7322606782, 0.5824160259, 0.0908198169, 0.4558122311, 0.0200274774, 0.0044662774, 0.0200951162, 0.0151668847
#>     diff. convergence measure: 0.0014180613
#>     time:                      34.78 seconds
#> 
#>   missForest iteration 4 in progress...done!
#>     OOB errors MSE:            0.0003519875, 0.0628745507, 0.0878521041, 0.0601606389, 0.1861817272, 2.2590258107, 320065.558576817, 0.0045284449, 0.0243025288, 0.0075103301
#>     OOB errors NMSE:           0.0015736818, 0.4390505143, 0.7327401137, 0.5849972073, 0.090562231, 0.4556527018, 0.0200163911, 0.003592344, 0.0193354627, 0.0155747501
#>     diff. convergence measure: 4.0933e-06
#>     time:                      34.17 seconds
#> 
#>   missForest iteration 5 in progress...done!
#>     OOB errors MSE:            0.0003756882, 0.0630494168, 0.0879549408, 0.0599350225, 0.189900218, 2.2572089337, 319680.371423409, 0.0046050811, 0.0244964369, 0.0076207539
#>     OOB errors NMSE:           0.0016796439, 0.4402715969, 0.7335978339, 0.5828033312, 0.0923709736, 0.4552862319, 0.0199923021, 0.0036531383, 0.019489739, 0.0158037443
#>     diff. convergence measure: -0.0001853137
#>     time:                      35.34 seconds

# impute test set
diamonds_train_imp_object$ximp <- NULL 
diamonds_test_imp <- missForestPredict::missForestPredict(diamonds_train_imp_object,
                                                          newdata = diamonds_test_miss)

head(diamonds_test_imp)
#> # A tibble: 6 x 10
#>   carat cut     color clarity depth table price     x     y     z
#>   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  0.33 Ideal   E     IF       61.9  57    1312  4.43  4.46  2.75
#> 2  0.3  Ideal   H     SI1      62.6  53.1   475  4.28  4.3   2.69
#> 3  0.4  Ideal   I     VVS2     62    56    1240  4.74  4.77  2.95
#> 4  0.73 Ideal   F     SI1      61.7  55    3249  5.79  5.82  3.58
#> 5  0.3  Premium E     SI2      62.1  58     540  4.31  4.28  2.66
#> 6  1.01 Fair    F     VS2      64.8  56.8  4791  6.3   6.25  4.07

Alternatively, the maxiter parameter can be set to a lower number (the default is 10) or other ranger parameters can be adapted. Note that not all parameters to ranger function are supported. Parameters that should be adapted in function of the imputed variable (outcome for the ranger function) are not supported. Generic parameters that can be applied to all variables are supported (like: num.trees, mtry, min.node.size, max.depth, replace, …), as well as class.weights for factor variables.

Custom initialization

The parameter initialization supports three values: mean/mode, median/mode and custom. The default is mean/mode, which will initialize each variable with the mean (for continuous variables) or mode (for categorical variables) calculated based on the complete observations on that variable.

When custom is used, a complete dataframe is expected in x_init. For example, the imputations of another imputation method can be used as initialization. When custom is used in missForest function, an initialization dataframe has to be passed to the missForestPredict function later on too.

We will exemplify this with an initialization using linear models on the iris dataset. Let’s assume that variables Sepal.Width and Species are known to be never missing. We will regress the other variables on these two variables and save the linear models to be applied on the test set afterwards.

data(iris)

# split train / test
N <- nrow(iris)
n_test <- floor(N/3)

set.seed(2022)
id_test <- sample(1:N, n_test)

iris_train <- iris[-id_test,]
iris_test <- iris[id_test,]

# produce missing values
set.seed(2022)
iris_train_miss <- produce_NA(iris_train, proportion = c(0.2, 0, 0.2, 0.2, 0))
iris_test_miss <- produce_NA(iris_test, proportion = c(0.2, 0, 0.2, 0.2, 0))

# build linear models for Sepal.Length, Petal.Width, Petal.Length using complete cases for each variable
fit_1 <- lm(Sepal.Length ~ ., data = iris_train_miss[!is.na(iris_train_miss$Sepal.Length), 
                                                     c("Sepal.Length", "Sepal.Width", "Species")])
fit_2 <- lm(Petal.Width ~ ., data = iris_train_miss[!is.na(iris_train_miss$Petal.Width), 
                                                    c("Petal.Width", "Sepal.Width", "Species")])
fit_3 <- lm(Petal.Length ~ ., data = iris_train_miss[!is.na(iris_train_miss$Petal.Length), 
                                                     c("Petal.Length", "Sepal.Width", "Species")])

# impute training with predictions of linear model
iris_train_init <- iris_train_miss
iris_train_init$Sepal.Length[is.na(iris_train_init$Sepal.Length)] <- 
  predict(fit_1, iris_train_init[is.na(iris_train_init$Sepal.Length), c("Sepal.Width", "Species")])
iris_train_init$Petal.Width[is.na(iris_train_init$Petal.Width)] <- 
  predict(fit_2, iris_train_init[is.na(iris_train_init$Petal.Width), c("Sepal.Width", "Species")])
iris_train_init$Petal.Length[is.na(iris_train_init$Petal.Length)] <- 
  predict(fit_3, iris_train_init[is.na(iris_train_init$Petal.Length), c("Sepal.Width", "Species")])

# impute the training set using this initialization
set.seed(2022)
iris_train_imp_obj <- missForest(iris_train_miss, 
                                 save_models = TRUE,
                                 initialization = "custom", 
                                 x_init = iris_train_init,
                                 num.threads = 2)
#> Imputation sequence (missing proportion):  Sepal.Width (0) Species (0) Sepal.Length (0.2) Petal.Length (0.2) Petal.Width (0.2) 
#>   missForest iteration 1 in progress...done!
#>     OOB errors MSE:            0.0729457458, 0.023224213, 0.164375904, 0.1299985753, 0.0364212886
#>     OOB errors NMSE:           0.4068272086, 0.104550779, 0.2563766732, 0.0430637102, 0.0632384372
#>     diff. convergence measure: 0.8791070598
#>     time:                      0.11 seconds
#> 
#>   missForest iteration 2 in progress...done!
#>     OOB errors MSE:            0.0889019494, 0.0286355472, 0.1397322375, 0.0891585948, 0.0356435945
#>     OOB errors NMSE:           0.4958168777, 0.1289115272, 0.2179400101, 0.0295349382, 0.0618881236
#>     diff. convergence measure: 0.0177719163
#>     time:                      0.13 seconds
#> 
#>   missForest iteration 3 in progress...done!
#>     OOB errors MSE:            0.0951034048, 0.0278894294, 0.1464458736, 0.0923955305, 0.0354256936
#>     OOB errors NMSE:           0.5304031411, 0.1255526535, 0.228411251, 0.0306072151, 0.0615097813
#>     diff. convergence measure: -0.0037217251
#>     time:                      0.11 seconds

# build test set initialization using the linear models learned on training
iris_test_init <- iris_test_miss
iris_test_init$Sepal.Length[is.na(iris_test_init$Sepal.Length)] <- 
  predict(fit_1, iris_test_init[is.na(iris_test_init$Sepal.Length), c("Sepal.Width", "Species")])
iris_test_init$Petal.Width[is.na(iris_test_init$Petal.Width)] <- 
  predict(fit_2, iris_test_init[is.na(iris_test_init$Petal.Width), c("Sepal.Width", "Species")])
iris_test_init$Petal.Length[is.na(iris_test_init$Petal.Length)] <- 
  predict(fit_3, iris_test_init[is.na(iris_test_init$Petal.Length), c("Sepal.Width", "Species")])

# impute test set
iris_test_imp <- missForestPredict(iris_train_imp_obj, newdata = iris_test_miss, 
                                   x_init = iris_test_init)

evaluate_imputation_error(iris_test_imp, iris_test_miss, iris_test)
#>       variable        MSE       NMSE MER macro_F1 F1_score
#> 1 Petal.Length 0.03452000 0.01345075  NA       NA       NA
#> 2  Petal.Width 0.04080528 0.12767611  NA       NA       NA
#> 3 Sepal.Length 0.08041247 0.16808626  NA       NA       NA
#> 4  Sepal.Width 0.00000000 0.00000000  NA       NA       NA
#> 5      Species         NA         NA   0        0        0
evaluate_imputation_error(iris_test_init, iris_test_miss, iris_test)
#>       variable        MSE       NMSE MER macro_F1 F1_score
#> 1 Petal.Length 0.07731568 0.03012612  NA       NA       NA
#> 2  Petal.Width 0.03350469 0.10483320  NA       NA       NA
#> 3 Sepal.Length 0.10322009 0.21576106  NA       NA       NA
#> 4  Sepal.Width 0.00000000 0.00000000  NA       NA       NA
#> 5      Species         NA         NA   0        0        0

The errors (MSE, NMSE) for Sepal.Length and Petal.Length decreased compared to the initialization imputation. The missclasification error (MER) for Species is 0 (if a variable does not have missing values, the error returned will be 0).

Also keep in mind that x_init should be complete dataframe, if used. For example, if you would wish to impute only one variable using a different method as initialization, you cannot rely on the missForest function to do mean/mode initialization for the other variables. You would have to do it yourself before providing x_init.

Using `predictor_matrix`

By default, missForest will create models for all variables in a dataframe and will use all other variables in the imputation of each variable. Controlling which variables to impute and which predictors to use for the imputation of each variable can be done using the parameter predictor_matrix. We illustrate three different cases:

variables not imputed and not used as predictor
variables not imputed but used as predictor
variables imputed but not used as predictor

Variables not imputed and not used as predictor

We consider the case when a date variable is kept in the dataframe for further reference. Such variable does not need imputation, nor can be used as predictor.

data(iris)

iris$Date_collected <- seq(Sys.Date() - nrow(iris) + 1, Sys.Date(), by="days")

# split train / test
N <- nrow(iris)
n_test <- floor(N/3)

set.seed(2022)
id_test <- sample(1:N, n_test)

iris_train <- iris[-id_test,]
iris_test <- iris[id_test,]

iris_train_miss <- produce_NA(iris_train, proportion = c(0.1,0.1,0.1,0.1,0.1,0))
iris_test_miss <- produce_NA(iris_test, proportion = c(0.1,0.1,0.1,0.1,0.1,0))

head(iris_train_miss)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Date_collected
#> 2           4.9         3.0          1.4         0.2  setosa     2023-07-16
#> 4           4.6         3.1          1.5         0.2  setosa     2023-07-18
#> 5           5.0         3.6          1.4         0.2  setosa     2023-07-19
#> 8           5.0         3.4          1.5         0.2  setosa     2023-07-22
#> 9            NA         2.9          1.4         0.2  setosa     2023-07-23
#> 10          4.9          NA          1.5         0.1  setosa     2023-07-24

We will create a predictor matrix so that the Date_collected variable is not included in imputation and not included as predictor. We will initialize the predictor matrix using the function create_predictor_matrix. This is the default predictor matrix in which all variables are included for imputation and all other variables are used as predictors. The diagonal is 0 as one variable will not be used as predictor for its own imputation model.


predictor_matrix <- create_predictor_matrix(iris_train_miss)

print(predictor_matrix)
#>                Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Sepal.Length              0           1            1           1       1
#> Sepal.Width               1           0            1           1       1
#> Petal.Length              1           1            0           1       1
#> Petal.Width               1           1            1           0       1
#> Species                   1           1            1           1       0
#> Date_collected            1           1            1           1       1
#>                Date_collected
#> Sepal.Length                1
#> Sepal.Width                 1
#> Petal.Length                1
#> Petal.Width                 1
#> Species                     1
#> Date_collected              0

The rows control the variables to be imputed and the columns control the variables as predictors. We will set all zeros for Date_collected both row-wise and column-wise.


predictor_matrix["Date_collected",] <- 0
predictor_matrix[,"Date_collected"] <- 0

print(predictor_matrix)
#>                Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Sepal.Length              0           1            1           1       1
#> Sepal.Width               1           0            1           1       1
#> Petal.Length              1           1            0           1       1
#> Petal.Width               1           1            1           0       1
#> Species                   1           1            1           1       0
#> Date_collected            0           0            0           0       0
#>                Date_collected
#> Sepal.Length                0
#> Sepal.Width                 0
#> Petal.Length                0
#> Petal.Width                 0
#> Species                     0
#> Date_collected              0

We will pass this matrix to the missForest function.


set.seed(2022)
iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, save_models = TRUE, 
                                                       predictor_matrix = predictor_matrix,
                                                       verbose = TRUE,
                                                       num.threads = 2)
#> Variables skipped from imputation (missing proportion):  Date_collected (0) 
#> Imputation sequence (missing proportion):  Sepal.Length (0.1) Sepal.Width (0.1) Petal.Length (0.1) Petal.Width (0.1) Species (0.1) 
#>   missForest iteration 1 in progress...done!
#>     OOB errors MSE:            0.1341044569, 0.1040272501, 0.1722004553, 0.0452709963, 0.0283471652
#>     OOB errors NMSE:           0.1937151536, 0.5888499508, 0.0550696408, 0.0747929897, 0.1276095062
#>     diff. convergence measure: 0.7919925518
#>     time:                      0.12 seconds
#> 
#>   missForest iteration 2 in progress...done!
#>     OOB errors MSE:            0.1319444727, 0.0949470098, 0.0762417102, 0.0421574795, 0.0298083803
#>     OOB errors NMSE:           0.1905950361, 0.5374509276, 0.0243820702, 0.0696490951, 0.1341874102
#>     diff. convergence measure: 0.0167545404
#>     time:                      0.13 seconds
#> 
#>   missForest iteration 3 in progress...done!
#>     OOB errors MSE:            0.1283661665, 0.0956347547, 0.0736066676, 0.0421326132, 0.0303473639
#>     OOB errors NMSE:           0.1854261389, 0.541343932, 0.0235393846, 0.0696080131, 0.1366137353
#>     diff. convergence measure: -5.3333e-05
#>     time:                      0.14 seconds

iris_train_imp <- iris_train_imp_object$ximp

head(iris_train_imp)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species Date_collected
#> 2      4.900000    3.000000          1.4         0.2  setosa     2023-07-16
#> 4      4.600000    3.100000          1.5         0.2  setosa     2023-07-18
#> 5      5.000000    3.600000          1.4         0.2  setosa     2023-07-19
#> 8      5.000000    3.400000          1.5         0.2  setosa     2023-07-22
#> 9      4.687727    2.900000          1.4         0.2  setosa     2023-07-23
#> 10     4.900000    3.107115          1.5         0.1  setosa     2023-07-24

iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object, 
                                                      newdata = iris_test_miss)

head(iris_test_imp)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Date_collected
#> 55      6.500000    2.800000     4.695463         1.5 versicolor     2023-09-07
#> 75      5.887097    2.900000     4.300000         1.3 versicolor     2023-09-27
#> 6       5.400000    3.606048     1.700000         0.4     setosa     2023-07-20
#> 123     7.700000    3.375697     6.700000         2.0  virginica     2023-11-14
#> 14      4.300000    3.000000     1.358195         0.1     setosa     2023-07-28
#> 7       4.600000    3.400000     1.400000         0.3     setosa     2023-07-21

Variables not imputed but used as predictor

We consider the case when we do not want to impute the variable Sepal.Length (for example, we do not plan to use it further in a prediction model), but we want to use it as predictor for the imputation of other variables.

data(iris)

# split train / test
N <- nrow(iris)
n_test <- floor(N/3)

set.seed(2022)
id_test <- sample(1:N, n_test)

iris_train <- iris[-id_test,]
iris_test <- iris[id_test,]

iris_train_miss <- produce_NA(iris_train, proportion = 0.1)
iris_test_miss <- produce_NA(iris_test, proportion = 0.1)

head(iris_train_miss)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 2           4.9         3.0          1.4         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9            NA         2.9          1.4         0.2  setosa
#> 10          4.9          NA          1.5         0.1  setosa

We create the predictor matrix and set zeros row-wise. The row represents the variable to be imputed, the columns represent the predictors used in imputation. Setting zero row-wise means: do not use any of the predictors in imputing Sepal.Length. This situation is interpreted as: do not build imputation models for Sepal.Length.


predictor_matrix <- create_predictor_matrix(iris_train_miss)
predictor_matrix["Sepal.Length",] <- 0

print(predictor_matrix)
#>              Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Sepal.Length            0           0            0           0       0
#> Sepal.Width             1           0            1           1       1
#> Petal.Length            1           1            0           1       1
#> Petal.Width             1           1            1           0       1
#> Species                 1           1            1           1       0


set.seed(2022)
iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, save_models = TRUE, 
                                                       predictor_matrix = predictor_matrix,
                                                       verbose = TRUE,
                                                       num.threads = 2)
#> Variables skipped from imputation (missing proportion):  Sepal.Length (0.1) 
#> Imputation sequence (missing proportion):  Sepal.Width (0.1) Petal.Length (0.1) Petal.Width (0.1) Species (0.1) 
#>   missForest iteration 1 in progress...done!
#>     OOB errors MSE:            0.1117129172, 0.1827309477, 0.0486749348, 0.0294848518
#>     OOB errors NMSE:           0.6323549429, 0.0584372883, 0.080416695, 0.1327309928
#>     diff. convergence measure: 0.7740150202
#>     time:                      0.09 seconds
#> 
#>   missForest iteration 2 in progress...done!
#>     OOB errors MSE:            0.106227064, 0.0768522608, 0.039277068, 0.0310534506
#>     OOB errors NMSE:           0.6013020759, 0.0245773241, 0.064890318, 0.1397923028
#>     diff. convergence measure: 0.0183444746
#>     time:                      0.11 seconds
#> 
#>   missForest iteration 3 in progress...done!
#>     OOB errors MSE:            0.1033715293, 0.0743138913, 0.0413143614, 0.0305141296
#>     OOB errors NMSE:           0.5851382198, 0.0237655545, 0.0682561654, 0.1373644589
#>     diff. convergence measure: 0.0040094055
#>     time:                      0.1 seconds
#> 
#>   missForest iteration 4 in progress...done!
#>     OOB errors MSE:            0.1059414378, 0.0725507058, 0.0407121795, 0.0289191544
#>     OOB errors NMSE:           0.5996852785, 0.0232016884, 0.0672612903, 0.1301844111
#>     diff. convergence measure: -0.0014520674
#>     time:                      0.11 seconds

iris_train_imp <- iris_train_imp_object$ximp

iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object, 
                                                      newdata = iris_test_miss)

head(iris_test_imp)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 55      6.500000    2.800000     4.689202         1.5 versicolor
#> 75      5.848889    2.900000     4.300000         1.3 versicolor
#> 6       5.400000    3.599842     1.700000         0.4     setosa
#> 123     7.700000    3.364393     6.700000         2.0  virginica
#> 14      4.300000    3.000000     1.345240         0.1     setosa
#> 7       4.600000    3.400000     1.400000         0.3     setosa

Models are not built for the variable Sepal.Length. Nevertheless this variable is initialized (with mean imputation), as it is used in other imputation models. As this variable is not imputed iteratively, its mean imputation might not be optimal for it being used as predictor. We recommend using this case only for complete variables in the training data and no expected missing values in future data.

names(iris_train_imp_object$models[[1]])
#> [1] "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
names(iris_train_imp_object$init)
#> [1] "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"      "Sepal.Length"

Variables imputed but not used as predictor

A third case is that when we want to impute a variable (based on all other variables), but we don’t want to use it as predictor for specific variables. For example we do not want to use the Species variable in the imputation of Petal.Length and Petal.Width.

data(iris)

# split train / test
N <- nrow(iris)
n_test <- floor(N/3)

set.seed(2022)
id_test <- sample(1:N, n_test)

iris_train <- iris[-id_test,]
iris_test <- iris[id_test,]

iris_train_miss <- produce_NA(iris_train, proportion = 0.1)
iris_test_miss <- produce_NA(iris_test, proportion = 0.1)

head(iris_train_miss)
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 2           4.9         3.0          1.4         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9            NA         2.9          1.4         0.2  setosa
#> 10          4.9          NA          1.5         0.1  setosa

predictor_matrix <- create_predictor_matrix(iris_train_miss)
predictor_matrix["Petal.Length","Species"] <- 0
predictor_matrix["Petal.Width","Species"] <- 0

print(predictor_matrix)
#>              Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Sepal.Length            0           1            1           1       1
#> Sepal.Width             1           0            1           1       1
#> Petal.Length            1           1            0           1       0
#> Petal.Width             1           1            1           0       0
#> Species                 1           1            1           1       0

set.seed(2022)
iris_train_imp_object <- missForestPredict::missForest(iris_train_miss, save_models = TRUE, 
                                                       predictor_matrix = predictor_matrix,
                                                       verbose = TRUE,
                                                       num.threads = 2)
#> Imputation sequence (missing proportion):  Sepal.Length (0.1) Sepal.Width (0.1) Petal.Length (0.1) Petal.Width (0.1) Species (0.1) 
#>   missForest iteration 1 in progress...done!
#>     OOB errors MSE:            0.1341044569, 0.1040272501, 0.2179548815, 0.0606276203, 0.0315971841
#>     OOB errors NMSE:           0.1937151536, 0.5888499508, 0.0697018891, 0.1001639317, 0.14224001
#>     diff. convergence measure: 0.781065813
#>     time:                      0.14 seconds
#> 
#>   missForest iteration 2 in progress...done!
#>     OOB errors MSE:            0.1303642673, 0.0916584987, 0.1635653256, 0.0630372602, 0.033080005
#>     OOB errors NMSE:           0.1883124145, 0.5188361933, 0.0523081296, 0.1041449391, 0.1489151765
#>     diff. convergence measure: 0.0164308165
#>     time:                      0.12 seconds
#> 
#>   missForest iteration 3 in progress...done!
#>     OOB errors MSE:            0.1332057859, 0.0976123121, 0.1561038932, 0.0596154603, 0.0324109161
#>     OOB errors NMSE:           0.1924170148, 0.552537966, 0.0499219663, 0.0984917248, 0.1459031607
#>     diff. convergence measure: -0.0053509959
#>     time:                      0.16 seconds

iris_train_imp <- iris_train_imp_object$ximp

iris_test_imp <- missForestPredict::missForestPredict(iris_train_imp_object, 
                                                      newdata = iris_test_miss)

head(iris_test_imp)
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 55       6.50000    2.800000     4.824410         1.5 versicolor
#> 75       5.89133    2.900000     4.300000         1.3 versicolor
#> 6        5.40000    3.596408     1.700000         0.4     setosa
#> 123      7.70000    3.374780     6.700000         2.0  virginica
#> 14       4.30000    3.000000     1.380106         0.1     setosa
#> 7        4.60000    3.400000     1.400000         0.3     setosa

A predictor matrix can be checked using the function check_predictor_matrix. In case the predictor matrix is not valid, an error will be returned. For valid predictor matrices, human readable message about the imputation scheme will be printed.

check_predictor_matrix(predictor_matrix, iris_train)
#> GREAT! All checks passed succesfully.
#> Variable Sepal.Length will be imputed using all other variables.
#> Variable Sepal.Width will be imputed using all other variables.
#> Variable Petal.Length will be imputed using Sepal.Length, Sepal.Width, Petal.Width.
#> Variable Petal.Width will be imputed using Sepal.Length, Sepal.Width, Petal.Length.
#> Variable Species will be imputed using all other variables.

Impact of `proportion_usable_cases` on the predictor matrix

The default predictor matrix can be altered by the setting of proportion_usable_cases.

missForest initializes each variable and then builds models for each variable using the observed values of that variable as outcome of a random forest model. It then imputes the missing part of the variable using the learned models. Situations when a variable is mostly or completely missing for the observed part of another variable are suboptimal for learning the models, as the models will mostly rely on the initialization rather than on true values. Situations when a variable is missing for the missing part of another variable are suboptimal for the imputation (prediction) part as the predictions will rely on the initialization rather than on true values. By default proportion_usable_cases will filter the variables (as predictors) if they are completely missing for the observed or missing part of another variable.

As an example, we will make Sepal.Length and Sepal.Width in iris missing together and Petal.Length missing for the observed part of these variables. Checking the predictor matrix post imputation will reveal that these variables will not be used as predictor for each other because of the “sub-optimal” missingness pattern.


data(iris)

iris_mis <- iris
iris_mis[1:50, "Sepal.Length"] <- NA
iris_mis[1:50, "Sepal.Width"] <- NA
iris_mis[51:150, "Petal.Length"] <- NA

set.seed(2022)
iris_train_imp_object <- missForestPredict::missForest(iris_mis, save_models = TRUE, 
                                                       verbose = TRUE,
                                                       num.threads = 2)
#> Imputation sequence (missing proportion):  Petal.Width (0) Species (0) Sepal.Length (0.333333333333333) Sepal.Width (0.333333333333333) Petal.Length (0.666666666666667) 
#>   missForest iteration 1 in progress...done!
#>     OOB errors MSE:            0.030348177, 0.0289622723, 0.3109868031, 0.078778252, 0.0294038219
#>     OOB errors NMSE:           0.0525843832, 0.1303302253, 0.7149845113, 0.7186747556, 0.9948511947
#>     diff. convergence measure: 0.144159586
#>     time:                      0.09 seconds
#> 
#>   missForest iteration 2 in progress...done!
#>     OOB errors MSE:            0.0305443675, 0.0291665594, 0.315261906, 0.0790942654, 0.0295874733
#>     OOB errors NMSE:           0.0529243232, 0.1312495175, 0.7248133283, 0.7215576688, 1.0010648692
#>     diff. convergence measure: -0.0062847698
#>     time:                      0.11 seconds

print(iris_train_imp_object$predictor_matrix)
#>              Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> Sepal.Length            0           0            0           1       1
#> Sepal.Width             0           0            0           1       1
#> Petal.Length            0           0            0           1       1
#> Petal.Width             1           1            1           0       1
#> Species                 1           1            1           1       0

check_predictor_matrix(iris_train_imp_object$predictor_matrix, iris_mis, verbose = TRUE)
#> GREAT! All checks passed succesfully.
#> Variable Sepal.Length will be imputed using Petal.Width, Species.
#> Variable Sepal.Width will be imputed using Petal.Width, Species.
#> Variable Petal.Length will be imputed using Petal.Width, Species.
#> Variable Petal.Width will be imputed using all other variables.
#> Variable Species will be imputed using all other variables.

Using the missForestPredict package

Elena Albu

2023-12-11

Introduction

What is this document?

Package information

Package functionality

How to install

How to use the package

Data used for demonstration

Imputation of training and test set

Imputation of a single new observation

Predict without storing the imputed matrix

Impute larger datasets by adapting `num.trees`

Custom initialization

Using `predictor_matrix`

Variables not imputed and not used as predictor

Variables not imputed but used as predictor

Variables imputed but not used as predictor

Impact of `proportion_usable_cases` on the predictor matrix

Final words

References

Using the missForestPredict package

Elena Albu

2023-12-11

Introduction

What is this document?

Package information

Package functionality

How to install

How to use the package

Data used for demonstration

Imputation of training and test set

Imputation of a single new observation

Predict without storing the imputed matrix

Impute larger datasets by adapting num.trees

Custom initialization

Using predictor_matrix

Variables not imputed and not used as predictor

Variables not imputed but used as predictor

Variables imputed but not used as predictor

Impact of proportion_usable_cases on the predictor matrix

Final words

References

Impute larger datasets by adapting `num.trees`

Using `predictor_matrix`

Impact of `proportion_usable_cases` on the predictor matrix