Imputation of Missing Values
mlr provides several imputation methods which are listed on the help page imputations. These include standard techniques as imputation by a constant value (like a fixed constant, the mean, median or mode) and random numbers (either from the empirical distribution of the feature under consideration or a certain distribution family). Moreover, missing values in one feature can be replaced based on the other features by predictions from any supervised Learner integrated into mlr.
Also note that some of the learning algorithms included in mlr can deal with missing values
in a sensible way, i.e., other than simply deleting observations with missing values.
Those Learners have the property
"missings" and thus can be identified
## Regression learners that can deal with missing values listLearners("regr", properties = "missings")[c("class", "package")] #> class package #> 1 regr.bartMachine bartMachine #> 2 regr.blackboost mboost,party #> 3 regr.cforest party #> 4 regr.ctree party #> 5 regr.cubist Cubist #> 6 regr.featureless mlr #> ... (12 rows, 2 cols)
See also the list of integrated learners in the Appendix.
Imputation and reimputation
Imputation can be done by function impute. You can specify an imputation method for each feature individually or for classes of features like numerics or factors. Moreover, you can generate dummy variables that indicate which values are missing, also either for classes of features or for individual features. These allow to identify the patterns and reasons for missing data and permit to treat imputed and observed values differently in a subsequent analysis.
Let's have a look at the airquality data set.
data(airquality) summary(airquality) #> Ozone Solar.R Wind Temp #> Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 #> 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 #> Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 #> Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 #> 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 #> Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 #> NA's :37 NA's :7 #> Month Day #> Min. :5.000 Min. : 1.0 #> 1st Qu.:6.000 1st Qu.: 8.0 #> Median :7.000 Median :16.0 #> Mean :6.993 Mean :15.8 #> 3rd Qu.:8.000 3rd Qu.:23.0 #> Max. :9.000 Max. :31.0 #>
There are 37
NA's in variable
Ozone (ozone pollution) and 7
NA's in variable
Solar.R (solar radiation).
For demonstration purposes we insert artificial
NA's in column
Wind (wind speed) and coerce it into a
airq = airquality ind = sample(nrow(airq), 10) airq$Wind[ind] = NA airq$Wind = cut(airq$Wind, c(0,8,16,24)) summary(airq) #> Ozone Solar.R Wind Temp #> Min. : 1.00 Min. : 7.0 (0,8] :51 Min. :56.00 #> 1st Qu.: 18.00 1st Qu.:115.8 (8,16] :86 1st Qu.:72.00 #> Median : 31.50 Median :205.0 (16,24]: 6 Median :79.00 #> Mean : 42.13 Mean :185.9 NA's :10 Mean :77.88 #> 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:85.00 #> Max. :168.00 Max. :334.0 Max. :97.00 #> NA's :37 NA's :7 #> Month Day #> Min. :5.000 Min. : 1.0 #> 1st Qu.:6.000 1st Qu.: 8.0 #> Median :7.000 Median :16.0 #> Mean :6.993 Mean :15.8 #> 3rd Qu.:8.000 3rd Qu.:23.0 #> Max. :9.000 Max. :31.0 #>
If you want to impute
NA's in all integer features (these include
Solar.R) by the mean,
in all factor features (
Wind) by the mode and additionally generate dummy variables for all integer features,
you can do this as follows:
imp = impute(airq, classes = list(integer = imputeMean(), factor = imputeMode()), dummy.classes = "integer")
impute returns a list where slot
$data contains the imputed data set.
Per default, the dummy variables are factors with levels
It is also possible to create numeric zero-one indicator variables.
head(imp$data, 10) #> Ozone Solar.R Wind Temp Month Day Ozone.dummy Solar.R.dummy #> 1 41.00000 190.0000 (0,8] 67 5 1 FALSE FALSE #> 2 36.00000 118.0000 (0,8] 72 5 2 FALSE FALSE #> 3 12.00000 149.0000 (8,16] 74 5 3 FALSE FALSE #> 4 18.00000 313.0000 (8,16] 62 5 4 FALSE FALSE #> 5 42.12931 185.9315 (8,16] 56 5 5 TRUE TRUE #> 6 28.00000 185.9315 (8,16] 66 5 6 FALSE TRUE #> 7 23.00000 299.0000 (8,16] 65 5 7 FALSE FALSE #> 8 19.00000 99.0000 (8,16] 59 5 8 FALSE FALSE #> 9 8.00000 19.0000 (16,24] 61 5 9 FALSE FALSE #> 10 42.12931 194.0000 (8,16] 69 5 10 TRUE FALSE
$desc is an ImputationDesc object that stores all relevant information about the
For the current example this includes the means and the mode computed on the non-missing data.
imp$desc #> Imputation description #> Target: #> Features: 6; Imputed: 6 #> impute.new.levels: TRUE #> recode.factor.levels: TRUE #> dummy.type: factor
The imputation description shows the name of the target variable (not present), the number of features
and the number of imputed features.
Note that the latter number refers to the features for which an imputation method was specified
(five integers plus one factor) and not to the features actually containing
dummy.type indicates that the dummy variables are factors.
For details on
recode.factor.levels see the help page of function impute.
Let's have a look at another example involving a target variable.
A possible learning task associated with the airquality data is to predict the ozone
pollution based on the meteorological features.
Since we do not want to use columns
Month we remove them.
airq = subset(airq, select = 1:4)
The first 100 observations are used as training data set.
airq.train = airq[1:100,] airq.test = airq[-c(1:100),]
In case of a supervised learning problem you need to pass the name of the target variable to impute. This prevents imputation and creation of a dummy variable for the target variable itself and makes sure that the target variable is not used to impute the features.
In contrast to the example above we specify imputation methods for individual features instead of classes of features.
Missing values in
Solar.R are imputed by random numbers drawn from the empirical distribution of
the non-missing observations.
Function imputeLearner allows to use all supervised learning algorithms integrated into mlr
The type of the Learner (
classif) must correspond to the class of the feature to
The missing values in
Wind are replaced by the predictions of a classification tree (rpart).
Per default, all available columns in
airq.train except the target variable (
Ozone) and the variable to
be imputed (
Wind) are used as features in the classification tree, here
You can also select manually which columns to use.
Note that rpart can deal with missing feature values, therefore the
NA's in column
do not pose a problem.
imp = impute(airq.train, target = "Ozone", cols = list(Solar.R = imputeHist(), Wind = imputeLearner("classif.rpart")), dummy.cols = c("Solar.R", "Wind")) summary(imp$data) #> Ozone Solar.R Wind Temp #> Min. : 1.00 Min. : 7.00 (0,8] :34 Min. :56.00 #> 1st Qu.: 16.00 1st Qu.: 98.75 (8,16] :61 1st Qu.:69.00 #> Median : 34.00 Median :221.50 (16,24]: 5 Median :79.50 #> Mean : 41.59 Mean :191.54 Mean :76.87 #> 3rd Qu.: 63.00 3rd Qu.:274.25 3rd Qu.:84.00 #> Max. :135.00 Max. :334.00 Max. :93.00 #> NA's :31 #> Solar.R.dummy Wind.dummy #> FALSE:93 FALSE:92 #> TRUE : 7 TRUE : 8 #> #> #> #> #> imp$desc #> Imputation description #> Target: Ozone #> Features: 3; Imputed: 2 #> impute.new.levels: TRUE #> recode.factor.levels: TRUE #> dummy.type: factor
airq.test.imp = reimpute(airq.test, imp$desc) head(airq.test.imp) #> Ozone Solar.R Wind Temp Solar.R.dummy Wind.dummy #> 1 110 207 (0,8] 90 FALSE FALSE #> 2 NA 222 (8,16] 92 FALSE FALSE #> 3 NA 137 (8,16] 86 FALSE FALSE #> 4 44 192 (8,16] 86 FALSE FALSE #> 5 28 273 (8,16] 82 FALSE FALSE #> 6 65 157 (8,16] 80 FALSE FALSE
Especially when evaluating a machine learning method by some resampling technique you might want that impute/reimpute are called automatically each time before training/prediction. This can be achieved by creating an imputation wrapper.
Fusing a learner with imputation
You can couple a Learner with imputation by function makeImputeWrapper which basically
has the same formal arguments as impute.
Like in the example above we impute
Solar.R by random numbers from its empirical distribution,
Wind by the predictions of a classification tree and generate dummy variables for both features.
lrn = makeImputeWrapper("regr.lm", cols = list(Solar.R = imputeHist(), Wind = imputeLearner("classif.rpart")), dummy.cols = c("Solar.R", "Wind")) lrn #> Learner regr.lm.imputed from package stats #> Type: regr #> Name: ; Short name: #> Class: ImputeWrapper #> Properties: numerics,factors,se,weights,missings #> Predict-Type: response #> Hyperparameters:
We again aim to predict the ozone pollution from the meteorological variables. In order to create the Task we need to delete observations with missing values in the target variable.
airq = subset(airq, subset = !is.na(airq$Ozone)) task = makeRegrTask(data = airq, target = "Ozone")
In the following the 3-fold cross-validated mean squared error is calculated.
rdesc = makeResampleDesc("CV", iters = 3) r = resample(lrn, task, resampling = rdesc, show.info = FALSE, models = TRUE) r$aggr #> mse.test.mean #> 524.3392
lapply(r$models, getLearnerModel, more.unwrap = TRUE) #> [] #> #> Call: #> stats::lm(formula = f, data = d) #> #> Coefficients: #> (Intercept) Solar.R Wind(8,16] #> -117.0954 0.0853 -27.6763 #> Wind(16,24] Temp Solar.R.dummyTRUE #> -9.0988 2.0505 -27.4152 #> Wind.dummyTRUE #> 2.2535 #> #> #> [] #> #> Call: #> stats::lm(formula = f, data = d) #> #> Coefficients: #> (Intercept) Solar.R Wind(8,16] #> -94.84542 0.03936 -16.26255 #> Wind(16,24] Temp Solar.R.dummyTRUE #> -7.00707 1.79513 -11.08578 #> Wind.dummyTRUE #> -0.68340 #> #> #> [] #> #> Call: #> stats::lm(formula = f, data = d) #> #> Coefficients: #> (Intercept) Solar.R Wind(8,16] #> -57.30438 0.07426 -30.70737 #> Wind(16,24] Temp Solar.R.dummyTRUE #> -18.25055 1.35898 -2.16654 #> Wind.dummyTRUE #> -5.56400
A second possibility to fuse a learner with imputation is provided by makePreprocWrapperCaret, which is an interface to caret's preProcess function. preProcess only works for numeric features and offers imputation by k-nearest neighbors, bagged trees, and by the median.