Imputation of Missing Values

mlr provides several imputation methods which are listed on the help page imputations. These include standard techniques as imputation by a constant value (like a fixed constant, the mean, median or mode) and random numbers (either from the empirical distribution of the feature under consideration or a certain distribution family). Moreover, missing values in one feature can be replaced based on the other features by predictions from any supervised Learner integrated into mlr.

If your favourite option is not implemented in mlr yet, you can easily create your own imputation method.

Also note that some of the learning algorithms included in mlr can deal with missing values in a sensible way, i.e., other than simply deleting observations with missing values. Those Learners have the property "missings" and thus can be identified using listLearners.

## Regression learners that can deal with missing values
listLearners("regr", properties = "missings")[c("class", "package")]
#> Warning in listLearners.character("regr", properties = "missings"): The following learners could not be constructed, probably because their packages are not installed:
#> classif.hdrda,classif.mxff
#> Check ?learners to see which packages you need or install mlr with all suggestions.
#>              class      package
#> 1 regr.bartMachine  bartMachine
#> 2  regr.blackboost mboost,party
#> 3     regr.cforest        party
#> 4       regr.ctree        party
#> 5      regr.cubist       Cubist
#> 6 regr.featureless          mlr
#> ... (12 rows, 2 cols)

See also the list of integrated learners in the Appendix.

Imputation and reimputation

Imputation can be done by function impute. You can specify an imputation method for each feature individually or for classes of features like numerics or factors. Moreover, you can generate dummy variables that indicate which values are missing, also either for classes of features or for individual features. These allow to identify the patterns and reasons for missing data and permit to treat imputed and observed values differently in a subsequent analysis.

Let's have a look at the airquality data set.

data(airquality)
summary(airquality)
#>      Ozone           Solar.R           Wind             Temp      
#>  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
#>  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
#>  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
#>  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
#>  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
#>  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
#>  NA's   :37       NA's   :7                                       
#>      Month            Day      
#>  Min.   :5.000   Min.   : 1.0  
#>  1st Qu.:6.000   1st Qu.: 8.0  
#>  Median :7.000   Median :16.0  
#>  Mean   :6.993   Mean   :15.8  
#>  3rd Qu.:8.000   3rd Qu.:23.0  
#>  Max.   :9.000   Max.   :31.0  
#> 

There are 37 NA's in variable Ozone (ozone pollution) and 7 NA's in variable Solar.R (solar radiation). For demonstration purposes we insert artificial NA's in column Wind (wind speed) and coerce it into a factor.

airq = airquality
ind = sample(nrow(airq), 10)
airq$Wind[ind] = NA
airq$Wind = cut(airq$Wind, c(0,8,16,24))
summary(airq)
#>      Ozone           Solar.R           Wind         Temp      
#>  Min.   :  1.00   Min.   :  7.0   (0,8]  :51   Min.   :56.00  
#>  1st Qu.: 18.00   1st Qu.:115.8   (8,16] :86   1st Qu.:72.00  
#>  Median : 31.50   Median :205.0   (16,24]: 6   Median :79.00  
#>  Mean   : 42.13   Mean   :185.9   NA's   :10   Mean   :77.88  
#>  3rd Qu.: 63.25   3rd Qu.:258.8                3rd Qu.:85.00  
#>  Max.   :168.00   Max.   :334.0                Max.   :97.00  
#>  NA's   :37       NA's   :7                                   
#>      Month            Day      
#>  Min.   :5.000   Min.   : 1.0  
#>  1st Qu.:6.000   1st Qu.: 8.0  
#>  Median :7.000   Median :16.0  
#>  Mean   :6.993   Mean   :15.8  
#>  3rd Qu.:8.000   3rd Qu.:23.0  
#>  Max.   :9.000   Max.   :31.0  
#> 

If you want to impute NA's in all integer features (these include Ozone and Solar.R) by the mean, in all factor features (Wind) by the mode and additionally generate dummy variables for all integer features, you can do this as follows:

imp = impute(airq, classes = list(integer = imputeMean(), factor = imputeMode()),
  dummy.classes = "integer")

impute returns a list where slot $data contains the imputed data set. Per default, the dummy variables are factors with levels "TRUE" and "FALSE". It is also possible to create numeric zero-one indicator variables.

head(imp$data, 10)
#>       Ozone  Solar.R    Wind Temp Month Day Ozone.dummy Solar.R.dummy
#> 1  41.00000 190.0000   (0,8]   67     5   1       FALSE         FALSE
#> 2  36.00000 118.0000   (0,8]   72     5   2       FALSE         FALSE
#> 3  12.00000 149.0000  (8,16]   74     5   3       FALSE         FALSE
#> 4  18.00000 313.0000  (8,16]   62     5   4       FALSE         FALSE
#> 5  42.12931 185.9315  (8,16]   56     5   5        TRUE          TRUE
#> 6  28.00000 185.9315  (8,16]   66     5   6       FALSE          TRUE
#> 7  23.00000 299.0000  (8,16]   65     5   7       FALSE         FALSE
#> 8  19.00000  99.0000  (8,16]   59     5   8       FALSE         FALSE
#> 9   8.00000  19.0000 (16,24]   61     5   9       FALSE         FALSE
#> 10 42.12931 194.0000  (8,16]   69     5  10        TRUE         FALSE

Slot $desc is an ImputationDesc object that stores all relevant information about the imputation. For the current example this includes the means and the mode computed on the non-missing data.

imp$desc
#> Imputation description
#> Target: 
#> Features: 6; Imputed: 6
#> impute.new.levels: TRUE
#> recode.factor.levels: TRUE
#> dummy.type: factor

The imputation description shows the name of the target variable (not present), the number of features and the number of imputed features. Note that the latter number refers to the features for which an imputation method was specified (five integers plus one factor) and not to the features actually containing NA's. dummy.type indicates that the dummy variables are factors. For details on impute.new.levels and recode.factor.levels see the help page of function impute.

Let's have a look at another example involving a target variable. A possible learning task associated with the airquality data is to predict the ozone pollution based on the meteorological features. Since we do not want to use columns Day and Month we remove them.

airq = subset(airq, select = 1:4)

The first 100 observations are used as training data set.

airq.train = airq[1:100,]
airq.test = airq[-c(1:100),]

In case of a supervised learning problem you need to pass the name of the target variable to impute. This prevents imputation and creation of a dummy variable for the target variable itself and makes sure that the target variable is not used to impute the features.

In contrast to the example above we specify imputation methods for individual features instead of classes of features.

Missing values in Solar.R are imputed by random numbers drawn from the empirical distribution of the non-missing observations.

Function imputeLearner allows to use all supervised learning algorithms integrated into mlr for imputation. The type of the Learner (regr, classif) must correspond to the class of the feature to be imputed. The missing values in Wind are replaced by the predictions of a classification tree (rpart). Per default, all available columns in airq.train except the target variable (Ozone) and the variable to be imputed (Wind) are used as features in the classification tree, here Solar.R and Temp. You can also select manually which columns to use. Note that rpart can deal with missing feature values, therefore the NA's in column Solar.R do not pose a problem.

imp = impute(airq.train, target = "Ozone", cols = list(Solar.R = imputeHist(),
  Wind = imputeLearner("classif.rpart")), dummy.cols = c("Solar.R", "Wind"))
summary(imp$data)
#>      Ozone           Solar.R            Wind         Temp      
#>  Min.   :  1.00   Min.   :  7.00   (0,8]  :34   Min.   :56.00  
#>  1st Qu.: 16.00   1st Qu.: 98.75   (8,16] :61   1st Qu.:69.00  
#>  Median : 34.00   Median :221.50   (16,24]: 5   Median :79.50  
#>  Mean   : 41.59   Mean   :191.54                Mean   :76.87  
#>  3rd Qu.: 63.00   3rd Qu.:274.25                3rd Qu.:84.00  
#>  Max.   :135.00   Max.   :334.00                Max.   :93.00  
#>  NA's   :31                                                    
#>  Solar.R.dummy Wind.dummy
#>  FALSE:93      FALSE:92  
#>  TRUE : 7      TRUE : 8  
#>                          
#>                          
#>                          
#>                          
#> 

imp$desc
#> Imputation description
#> Target: Ozone
#> Features: 3; Imputed: 2
#> impute.new.levels: TRUE
#> recode.factor.levels: TRUE
#> dummy.type: factor

The ImputationDesc object can be used by function reimpute to impute the test data set the same way as the training data.

airq.test.imp = reimpute(airq.test, imp$desc)
head(airq.test.imp)
#>   Ozone Solar.R   Wind Temp Solar.R.dummy Wind.dummy
#> 1   110     207  (0,8]   90         FALSE      FALSE
#> 2    NA     222 (8,16]   92         FALSE      FALSE
#> 3    NA     137 (8,16]   86         FALSE      FALSE
#> 4    44     192 (8,16]   86         FALSE      FALSE
#> 5    28     273 (8,16]   82         FALSE      FALSE
#> 6    65     157 (8,16]   80         FALSE      FALSE

Especially when evaluating a machine learning method by some resampling technique you might want that impute/reimpute are called automatically each time before training/prediction. This can be achieved by creating an imputation wrapper.

Fusing a learner with imputation

You can couple a Learner with imputation by function makeImputeWrapper which basically has the same formal arguments as impute. Like in the example above we impute Solar.R by random numbers from its empirical distribution, Wind by the predictions of a classification tree and generate dummy variables for both features.

lrn = makeImputeWrapper("regr.lm", cols = list(Solar.R = imputeHist(),
  Wind = imputeLearner("classif.rpart")), dummy.cols = c("Solar.R", "Wind"))
lrn
#> Learner regr.lm.imputed from package stats
#> Type: regr
#> Name: ; Short name: 
#> Class: ImputeWrapper
#> Properties: numerics,factors,se,weights,missings
#> Predict-Type: response
#> Hyperparameters:

Before training the resulting Learner, impute is applied to the training set. Before prediction reimpute is called on the test set and the ImputationDesc object from the training stage.

We again aim to predict the ozone pollution from the meteorological variables. In order to create the Task we need to delete observations with missing values in the target variable.

airq = subset(airq, subset = !is.na(airq$Ozone))
task = makeRegrTask(data = airq, target = "Ozone")

In the following the 3-fold cross-validated mean squared error is calculated.

rdesc = makeResampleDesc("CV", iters = 3)
r = resample(lrn, task, resampling = rdesc, show.info = FALSE, models = TRUE)
r$aggr
#> mse.test.mean 
#>      524.3392
lapply(r$models, getLearnerModel, more.unwrap = TRUE)
#> [[1]]
#> 
#> Call:
#> stats::lm(formula = f, data = d)
#> 
#> Coefficients:
#>       (Intercept)            Solar.R         Wind(8,16]  
#>         -117.0954             0.0853           -27.6763  
#>       Wind(16,24]               Temp  Solar.R.dummyTRUE  
#>           -9.0988             2.0505           -27.4152  
#>    Wind.dummyTRUE  
#>            2.2535  
#> 
#> 
#> [[2]]
#> 
#> Call:
#> stats::lm(formula = f, data = d)
#> 
#> Coefficients:
#>       (Intercept)            Solar.R         Wind(8,16]  
#>         -94.84542            0.03936          -16.26255  
#>       Wind(16,24]               Temp  Solar.R.dummyTRUE  
#>          -7.00707            1.79513          -11.08578  
#>    Wind.dummyTRUE  
#>          -0.68340  
#> 
#> 
#> [[3]]
#> 
#> Call:
#> stats::lm(formula = f, data = d)
#> 
#> Coefficients:
#>       (Intercept)            Solar.R         Wind(8,16]  
#>         -57.30438            0.07426          -30.70737  
#>       Wind(16,24]               Temp  Solar.R.dummyTRUE  
#>         -18.25055            1.35898           -2.16654  
#>    Wind.dummyTRUE  
#>          -5.56400

A second possibility to fuse a learner with imputation is provided by makePreprocWrapperCaret, which is an interface to caret's preProcess function. preProcess only works for numeric features and offers imputation by k-nearest neighbors, bagged trees, and by the median.