# Creating an Imputation Method

Function makeImputeMethod permits to create your own imputation method. For this purpose you need to specify a learn function that extracts the necessary information and an impute function that does the actual imputation. The learn and impute functions both have at least the following formal arguments:

• data is a data.frame with missing values in some features.
• col indicates the feature to be imputed.
• target indicates the target variable(s) in a supervised learning task.

## Example: Imputation using the mean

Let's have a look at function imputeMean.

imputeMean = function() {
makeImputeMethod(learn = function(data, target, col) mean(data[[col]], na.rm = TRUE),
impute = simpleImpute)
}


imputeMean calls the unexported mlr function simpleImpute which is defined as follows.

simpleImpute = function(data, target, col, const) {
if (is.na(const))
stopf("Error imputing column '%s'. Maybe all input data was missing?", col)
x = data[[col]]
if (is.logical(x) && !is.logical(const)) {
x = as.factor(x)
}
if (is.factor(x) && const %nin% levels(x)) {
levels(x) = c(levels(x), as.character(const))
}
replace(x, is.na(x), const)
}


The learn function calculates the mean of the non-missing observations in column col. The mean is passed via argument const to the impute function that replaces all missing values in feature col.

## Writing your own imputation method

Now let's write a new imputation method: A frequently used simple technique for longitudinal data is last observation carried forward (LOCF). Missing values are replaced by the most recent observed value.

In the R code below the learn function determines the last observed value previous to each NA (values) as well as the corresponding number of consecutive NA's (times). The impute function generates a vector by replicating the entries in values according to times and replaces the NA's in feature col.

imputeLOCF = function() {
makeImputeMethod(
learn = function(data, target, col) {
x = data[[col]]
ind = is.na(x)
dind = diff(ind)
lastValue = which(dind == 1)  # position of the last observed value previous to NA
lastNA = which(dind == -1)    # position of the last of potentially several consecutive NA's
values = x[lastValue]         # last observed value previous to NA
times = lastNA - lastValue    # number of consecutive NA's
return(list(values = values, times = times))
},
impute = function(data, target, col, values, times) {
x = data[[col]]
replace(x, is.na(x), rep(values, times))
}
)
}


Note that this function is just for demonstration and is lacking some checks for real-world usage (for example 'What should happen if the first value in x is already missing?'). Below it is used to impute the missing values in features Ozone and Solar.R in the airquality data set.

data(airquality)
imp = impute(airquality, cols = list(Ozone = imputeLOCF(), Solar.R = imputeLOCF()),
dummy.cols = c("Ozone", "Solar.R"))
head(imp$data, 10) #> Ozone Solar.R Wind Temp Month Day Ozone.dummy Solar.R.dummy #> 1 41 190 7.4 67 5 1 FALSE FALSE #> 2 36 118 8.0 72 5 2 FALSE FALSE #> 3 12 149 12.6 74 5 3 FALSE FALSE #> 4 18 313 11.5 62 5 4 FALSE FALSE #> 5 18 313 14.3 56 5 5 TRUE TRUE #> 6 28 313 14.9 66 5 6 FALSE TRUE #> 7 23 299 8.6 65 5 7 FALSE FALSE #> 8 19 99 13.8 59 5 8 FALSE FALSE #> 9 8 19 20.1 61 5 9 FALSE FALSE #> 10 8 194 8.6 69 5 10 TRUE FALSE  ## Complete code listing The above code without the output is given below: imputeLOCF = function() { makeImputeMethod( learn = function(data, target, col) { x = data[[col]] ind = is.na(x) dind = diff(ind) lastValue = which(dind == 1) # position of the last observed value previous to NA lastNA = which(dind == -1) # position of the last of potentially several consecutive NA's values = x[lastValue] # last observed value previous to NA times = lastNA - lastValue # number of consecutive NA's return(list(values = values, times = times)) }, impute = function(data, target, col, values, times) { x = data[[col]] replace(x, is.na(x), rep(values, times)) } ) } data(airquality) imp = impute(airquality, cols = list(Ozone = imputeLOCF(), Solar.R = imputeLOCF()), dummy.cols = c("Ozone", "Solar.R")) head(imp$data, 10)