Data Preprocessing

Data preprocessing refers to any transformation of the data done before applying a learning algorithm. This comprises for example finding and resolving inconsistencies, imputation of missing values, identifying, removing or replacing outliers, discretizing numerical data or generating numerical dummy variables for categorical data, any kind of transformation like standardization of predictors or Box-Cox, dimensionality reduction and feature extraction and/or selection.

mlr offers several options for data preprocessing. Some of the following simple methods to change a Task (or data.frame) were already mentioned on the page about learning tasks:

Moreover, there are tutorial pages devoted to

Fusing learners with preprocessing

mlr's wrapper functionality permits to combine learners with preprocessing steps. This means that the preprocessing "belongs" to the learner and is done any time the learner is trained or predictions are made.

This is, on the one hand, very practical. You don't need to change any data or learning Tasks and it's quite easy to combine different learners with different preprocessing steps.

On the other hand this helps to avoid a common mistake in evaluating the performance of a learner with preprocessing: Preprocessing is often seen as completely independent of the later applied learning algorithms. When estimating the performance of the a learner, e.g., by cross-validation all preprocessing is done beforehand on the full data set and only training/predicting the learner is done on the train/test sets. Depending on what exactly is done as preprocessing this can lead to overoptimistic results. For example if imputation by the mean is done on the whole data set before evaluating the learner performance you are using information from the test data during training, which can cause overoptimistic performance results.

To clarify things one should distinguish between data-dependent and data-independent preprocessing steps: Data-dependent steps in some way learn from the data and give different results when applied to different data sets. Data-independent steps always lead to the same results. Clearly, correcting errors in the data or removing data columns like Ids that should not be used for learning, is data-independent. Imputation of missing values by the mean, as mentioned above, is data-dependent. Imputation by a fixed constant, however, is not.

To get a honest estimate of learner performance combined with preprocessing, all data-dependent preprocessing steps must be included in the resampling. This is automatically done when fusing a learner with preprocessing.

To this end mlr provides two wrappers:

As mentioned above the specified preprocessing steps then "belong" to the wrapped Learner. In contrast to the preprocessing options listed above like normalizeFeatures

We start with some examples for makePreprocWrapperCaret.

Preprocessing with makePreprocWrapperCaret

makePreprocWrapperCaret is an interface to caret's preProcess function that provides many different options like imputation of missing values, data transformations as scaling the features to a certain range or Box-Cox and dimensionality reduction via Independent or Principal Component Analysis. For all possible options see the help page of function preProcess.

Note that the usage of makePreprocWrapperCaret is slightly different than that of preProcess.

For example the following call to preProcess

preProcess(x, method = c("knnImpute", "pca"), pcaComp = 10)

with x being a matrix or data.frame would thus translate into

makePreprocWrapperCaret(learner, ppc.knnImpute = TRUE, ppc.pca = TRUE, ppc.pcaComp = 10)

where learner is a mlr Learner or the name of a learner class like "classif.lda".

If you enable multiple preprocessing options (like knn imputation and principal component analysis above) these are executed in a certain order detailed on the help page of function preProcess.

In the following we show an example where principal components analysis (PCA) is used for dimensionality reduction. This should never be applied blindly, but can be beneficial with learners that get problems with high dimensionality or those that can profit from rotating the data.

We consider the sonar.task, which poses a binary classification problem with 208 observations and 60 features.

sonar.task
#> Supervised task: Sonar-example
#> Type: classif
#> Target: Class
#> Observations: 208
#> Features:
#> numerics  factors  ordered 
#>       60        0        0 
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Classes: 2
#>   M   R 
#> 111  97 
#> Positive class: M

Below we fuse quadratic discriminant analysis from package MASS with a principal components preprocessing step. The threshold is set to 0.9, i.e., the principal components necessary to explain a cumulative percentage of 90% of the total variance are kept. The data are automatically standardized prior to PCA.

lrn = makePreprocWrapperCaret("classif.qda", ppc.pca = TRUE, ppc.thresh = 0.9)
lrn
#> Learner classif.qda.preproc from package MASS
#> Type: classif
#> Name: ; Short name: 
#> Class: PreprocWrapperCaret
#> Properties: twoclass,multiclass,numerics,factors,prob
#> Predict-Type: response
#> Hyperparameters: ppc.BoxCox=FALSE,ppc.YeoJohnson=FALSE,ppc.expoTrans=FALSE,ppc.center=TRUE,ppc.scale=TRUE,ppc.range=FALSE,ppc.knnImpute=FALSE,ppc.bagImpute=FALSE,ppc.medianImpute=FALSE,ppc.pca=TRUE,ppc.ica=FALSE,ppc.spatialSign=FALSE,ppc.thresh=0.9,ppc.na.remove=TRUE,ppc.k=5,ppc.fudge=0.2,ppc.numUnique=3

The wrapped learner is trained on the sonar.task. By inspecting the underlying qda model, we see that the first 22 principal components have been used for training.

mod = train(lrn, sonar.task)
mod
#> Model for learner.id=classif.qda.preproc; learner.class=PreprocWrapperCaret
#> Trained on: task.id = Sonar-example; obs = 208; features = 60
#> Hyperparameters: ppc.BoxCox=FALSE,ppc.YeoJohnson=FALSE,ppc.expoTrans=FALSE,ppc.center=TRUE,ppc.scale=TRUE,ppc.range=FALSE,ppc.knnImpute=FALSE,ppc.bagImpute=FALSE,ppc.medianImpute=FALSE,ppc.pca=TRUE,ppc.ica=FALSE,ppc.spatialSign=FALSE,ppc.thresh=0.9,ppc.na.remove=TRUE,ppc.k=5,ppc.fudge=0.2,ppc.numUnique=3

getLearnerModel(mod)
#> Model for learner.id=classif.qda; learner.class=classif.qda
#> Trained on: task.id = Sonar-example; obs = 208; features = 22
#> Hyperparameters:

getLearnerModel(mod, more.unwrap = TRUE)
#> Call:
#> qda(f, data = getTaskData(.task, .subset, recode.target = "drop.levels"))
#> 
#> Prior probabilities of groups:
#>         M         R 
#> 0.5336538 0.4663462 
#> 
#> Group means:
#>          PC1        PC2        PC3         PC4         PC5         PC6
#> M  0.5976122 -0.8058235  0.9773518  0.03794232 -0.04568166 -0.06721702
#> R -0.6838655  0.9221279 -1.1184128 -0.04341853  0.05227489  0.07691845
#>          PC7         PC8        PC9       PC10        PC11          PC12
#> M  0.2278162 -0.01034406 -0.2530606 -0.1793157 -0.04084466 -0.0004789888
#> R -0.2606969  0.01183702  0.2895848  0.2051963  0.04673977  0.0005481212
#>          PC13       PC14        PC15        PC16        PC17        PC18
#> M -0.06138758 -0.1057137  0.02808048  0.05215865 -0.07453265  0.03869042
#> R  0.07024765  0.1209713 -0.03213333 -0.05968671  0.08528994 -0.04427460
#>          PC19         PC20        PC21         PC22
#> M -0.01192247  0.006098658  0.01263492 -0.001224809
#> R  0.01364323 -0.006978877 -0.01445851  0.001401586

Below the performances of qda with and without PCA preprocessing are compared in a benchmark experiment. Note that we use stratified resampling to prevent errors in qda due to a too small number of observations from either class.

rin = makeResampleInstance("CV", iters = 3, stratify = TRUE, task = sonar.task)
res = benchmark(list("classif.qda", lrn), sonar.task, rin, show.info = FALSE)
res
#>         task.id          learner.id mmce.test.mean
#> 1 Sonar-example         classif.qda      0.3941339
#> 2 Sonar-example classif.qda.preproc      0.2643202

PCA preprocessing in this case turns out to be really beneficial for the performance of Quadratic Discriminant Analysis.

Joint tuning of preprocessing options and learner parameters

Let's see if we can optimize this a bit. The threshold value of 0.9 above was chosen arbitrarily and led to 22 out of 60 principal components. But maybe a lower or higher number of principal components should be used. Moreover, qda has several options that control how the class covariance matrices or class probabilities are estimated.

Those preprocessing and learner parameters can be tuned jointly. Before doing this let's first get an overview of all the parameters of the wrapped learner using function getParamSet.

getParamSet(lrn)
#>                      Type len     Def                      Constr Req
#> ppc.BoxCox        logical   -   FALSE                           -   -
#> ppc.YeoJohnson    logical   -   FALSE                           -   -
#> ppc.expoTrans     logical   -   FALSE                           -   -
#> ppc.center        logical   -    TRUE                           -   -
#> ppc.scale         logical   -    TRUE                           -   -
#> ppc.range         logical   -   FALSE                           -   -
#> ppc.knnImpute     logical   -   FALSE                           -   -
#> ppc.bagImpute     logical   -   FALSE                           -   -
#> ppc.medianImpute  logical   -   FALSE                           -   -
#> ppc.pca           logical   -   FALSE                           -   -
#> ppc.ica           logical   -   FALSE                           -   -
#> ppc.spatialSign   logical   -   FALSE                           -   -
#> ppc.thresh        numeric   -    0.95                    0 to Inf   -
#> ppc.pcaComp       integer   -       -                    1 to Inf   -
#> ppc.na.remove     logical   -    TRUE                           -   -
#> ppc.k             integer   -       5                    1 to Inf   -
#> ppc.fudge         numeric   -     0.2                    0 to Inf   -
#> ppc.numUnique     integer   -       3                    1 to Inf   -
#> ppc.n.comp        integer   -       -                    1 to Inf   -
#> method           discrete   -  moment            moment,mle,mve,t   -
#> nu                numeric   -       5                    2 to Inf   Y
#> predict.method   discrete   - plug-in plug-in,predictive,debiased   -
#>                  Tunable Trafo
#> ppc.BoxCox          TRUE     -
#> ppc.YeoJohnson      TRUE     -
#> ppc.expoTrans       TRUE     -
#> ppc.center          TRUE     -
#> ppc.scale           TRUE     -
#> ppc.range           TRUE     -
#> ppc.knnImpute       TRUE     -
#> ppc.bagImpute       TRUE     -
#> ppc.medianImpute    TRUE     -
#> ppc.pca             TRUE     -
#> ppc.ica             TRUE     -
#> ppc.spatialSign     TRUE     -
#> ppc.thresh          TRUE     -
#> ppc.pcaComp         TRUE     -
#> ppc.na.remove       TRUE     -
#> ppc.k               TRUE     -
#> ppc.fudge           TRUE     -
#> ppc.numUnique       TRUE     -
#> ppc.n.comp          TRUE     -
#> method              TRUE     -
#> nu                  TRUE     -
#> predict.method      TRUE     -

The parameters prefixed by ppc. belong to preprocessing. method, nu and predict.method are qda parameters.

Instead of tuning the PCA threshold (ppc.thresh) we tune the number of principal components (ppc.pcaComp) directly. Moreover, for qda we try two different ways to estimate the posterior probabilities (parameter predict.method): the usual plug-in estimates and unbiased estimates.

We perform a grid search and set the resolution to 10. This is for demonstration. You might want to use a finer resolution.

ps = makeParamSet(
  makeIntegerParam("ppc.pcaComp", lower = 1, upper = getTaskNFeats(sonar.task)),
  makeDiscreteParam("predict.method", values = c("plug-in", "debiased"))
)
ctrl = makeTuneControlGrid(resolution = 10)
res = tuneParams(lrn, sonar.task, rin, par.set = ps, control = ctrl, show.info = FALSE)
res
#> Tune result:
#> Op. pars: ppc.pcaComp=8; predict.method=plug-in
#> mmce.test.mean=0.192

as.data.frame(res$opt.path)[1:3]
#>    ppc.pcaComp predict.method mmce.test.mean
#> 1            1        plug-in      0.4757074
#> 2            8        plug-in      0.1920635
#> 3           14        plug-in      0.2162871
#> 4           21        plug-in      0.2643202
#> 5           27        plug-in      0.2454106
#> 6           34        plug-in      0.2645273
#> 7           40        plug-in      0.2742581
#> 8           47        plug-in      0.3173223
#> 9           53        plug-in      0.3512767
#> 10          60        plug-in      0.3941339
#> 11           1       debiased      0.5336094
#> 12           8       debiased      0.2450656
#> 13          14       debiased      0.2403037
#> 14          21       debiased      0.2546584
#> 15          27       debiased      0.3075224
#> 16          34       debiased      0.3172533
#> 17          40       debiased      0.3125604
#> 18          47       debiased      0.2979986
#> 19          53       debiased      0.3079365
#> 20          60       debiased      0.3654244

There seems to be a preference for a lower number of principal components (<27) for both "plug-in" and "debiased" with "plug-in" achieving slightly lower error rates.

Writing a custom preprocessing wrapper

If the options offered by makePreprocWrapperCaret are not enough, you can write your own preprocessing wrapper using function makePreprocWrapper.

As described in the tutorial section about wrapped learners wrappers are implemented using a train and a predict method. In case of preprocessing wrappers these methods specify how to transform the data before training and before prediction and are completely user-defined.

Below we show how to create a preprocessing wrapper that centers and scales the data before training/predicting. Some learning methods as, e.g., k nearest neighbors, support vector machines or neural networks usually require scaled features. Many, but not all, have a built-in scaling option where the training data set is scaled before model fitting and the test data set is scaled accordingly, that is by using the scaling parameters from the training stage, before making predictions. In the following we show how to add a scaling option to a Learner by coupling it with function scale.

Note that we chose this simple example for demonstration. Centering/scaling the data is also possible with makePreprocWrapperCaret.

Specifying the train function

The train function has to be a function with the following arguments:

It must return a list with elements $data and $control, where $data is the preprocessed data set and $control stores all information required to preprocess the data before prediction.

The train function for the scaling example is given below. It calls scale on the numerical features and returns the scaled training data and the corresponding scaling parameters.

args contains the center and scale arguments of function scale and slot $control stores the scaling parameters to be used in the prediction stage.

Regarding the latter note that the center and scale arguments of scale can be either a logical value or a numeric vector of length equal to the number of the numeric columns in data, respectively. If a logical value was passed to args we store the column means and standard deviations/ root mean squares in the $center and $scale slots of the returned $control object.

trainfun = function(data, target, args = list(center, scale)) {
  ## Identify numerical features
  cns = colnames(data)
  nums = setdiff(cns[sapply(data, is.numeric)], target)
  ## Extract numerical features from the data set and call scale
  x = as.matrix(data[, nums, drop = FALSE])
  x = scale(x, center = args$center, scale = args$scale)
  ## Store the scaling parameters in control
  ## These are needed to preprocess the data before prediction
  control = args
  if (is.logical(control$center) && control$center)
    control$center = attr(x, "scaled:center")
  if (is.logical(control$scale) && control$scale)
    control$scale = attr(x, "scaled:scale")
  ## Recombine the data
  data = data[, setdiff(cns, nums), drop = FALSE]
  data = cbind(data, as.data.frame(x))
  return(list(data = data, control = control))
}

Specifying the predict function

The predict function has the following arguments:

It returns the preprocessed data.

In our scaling example the predict function scales the numerical features using the parameters from the training stage stored in control.

predictfun = function(data, target, args, control) {
  ## Identify numerical features
  cns = colnames(data)
  nums = cns[sapply(data, is.numeric)]
  ## Extract numerical features from the data set and call scale
  x = as.matrix(data[, nums, drop = FALSE])
  x = scale(x, center = control$center, scale = control$scale)
  ## Recombine the data
  data = data[, setdiff(cns, nums), drop = FALSE]
  data = cbind(data, as.data.frame(x))
  return(data)
}

Creating the preprocessing wrapper

Below we create a preprocessing wrapper with a regression neural network (which itself does not have a scaling option) as base learner.

The train and predict functions defined above are passed to makePreprocWrapper via the train and predict arguments. par.vals is a list of parameter values that is relayed to the args argument of the train function.

lrn = makeLearner("regr.nnet", trace = FALSE, decay = 1e-02)
lrn = makePreprocWrapper(lrn, train = trainfun, predict = predictfun,
  par.vals = list(center = TRUE, scale = TRUE))
lrn
#> Learner regr.nnet.preproc from package nnet
#> Type: regr
#> Name: ; Short name: 
#> Class: PreprocWrapper
#> Properties: numerics,factors,weights
#> Predict-Type: response
#> Hyperparameters: size=3,trace=FALSE,decay=0.01

Let's compare the cross-validated mean squared error (mse) on the Boston Housing data set with and without scaling.

rdesc = makeResampleDesc("CV", iters = 3)

r = resample(lrn, bh.task, resampling = rdesc, show.info = FALSE)
r
#> Resample Result
#> Task: BostonHousing-example
#> Learner: regr.nnet.preproc
#> Aggr perf: mse.test.mean=20.3
#> Runtime: 0.234115

lrn = makeLearner("regr.nnet", trace = FALSE, decay = 1e-02)
r = resample(lrn, bh.task, resampling = rdesc, show.info = FALSE)
r
#> Resample Result
#> Task: BostonHousing-example
#> Learner: regr.nnet
#> Aggr perf: mse.test.mean=55.1
#> Runtime: 0.184568

Joint tuning of preprocessing and learner parameters

Often it's not clear which preprocessing options work best with a certain learning algorithm. As already shown for the number of principal components in makePreprocWrapperCaret we can tune them easily together with other hyperparameters of the learner.

In our scaling example we can try if nnet works best with both centering and scaling the data or if it's better to omit one of the two operations or do no preprocessing at all. In order to tune center and scale we have to add appropriate LearnerParams to the parameter set of the wrapped learner.

As mentioned above scale allows for numeric and logical center and scale arguments. As we want to use the latter option we declare center and scale as logical learner parameters.

lrn = makeLearner("regr.nnet", trace = FALSE)
lrn = makePreprocWrapper(lrn, train = trainfun, predict = predictfun,
  par.set = makeParamSet(
    makeLogicalLearnerParam("center"),
    makeLogicalLearnerParam("scale")
  ),
  par.vals = list(center = TRUE, scale = TRUE))

lrn
#> Learner regr.nnet.preproc from package nnet
#> Type: regr
#> Name: ; Short name: 
#> Class: PreprocWrapper
#> Properties: numerics,factors,weights
#> Predict-Type: response
#> Hyperparameters: size=3,trace=FALSE,center=TRUE,scale=TRUE

getParamSet(lrn)
#>             Type len    Def      Constr Req Tunable Trafo
#> center   logical   -      -           -   -    TRUE     -
#> scale    logical   -      -           -   -    TRUE     -
#> size     integer   -      3    0 to Inf   -    TRUE     -
#> maxit    integer   -    100    1 to Inf   -    TRUE     -
#> linout   logical   -  FALSE           -   Y    TRUE     -
#> entropy  logical   -  FALSE           -   Y    TRUE     -
#> softmax  logical   -  FALSE           -   Y    TRUE     -
#> censored logical   -  FALSE           -   Y    TRUE     -
#> skip     logical   -  FALSE           -   -    TRUE     -
#> rang     numeric   -    0.7 -Inf to Inf   -    TRUE     -
#> decay    numeric   -      0    0 to Inf   -    TRUE     -
#> Hess     logical   -  FALSE           -   -    TRUE     -
#> trace    logical   -   TRUE           -   -   FALSE     -
#> MaxNWts  integer   -   1000    1 to Inf   -    TRUE     -
#> abstol   numeric   - 0.0001 -Inf to Inf   -    TRUE     -
#> reltol   numeric   -  1e-08 -Inf to Inf   -    TRUE     -

Now we do a simple grid search for the decay parameter of nnet and the center and scale parameters.

rdesc = makeResampleDesc("Holdout")
ps = makeParamSet(
  makeDiscreteParam("decay", c(0, 0.05, 0.1)),
  makeLogicalParam("center"),
  makeLogicalParam("scale")
)
ctrl = makeTuneControlGrid()
res = tuneParams(lrn, bh.task, rdesc, par.set = ps, control = ctrl, show.info = FALSE)

res
#> Tune result:
#> Op. pars: decay=0.05; center=FALSE; scale=TRUE
#> mse.test.mean=14.8

as.data.frame(res$opt.path)
#>    decay center scale mse.test.mean dob eol error.message exec.time
#> 1      0   TRUE  TRUE      49.38128   1  NA          <NA>     0.101
#> 2   0.05   TRUE  TRUE      24.33826   2  NA          <NA>     0.115
#> 3    0.1   TRUE  TRUE      22.61593   3  NA          <NA>     0.081
#> 4      0  FALSE  TRUE      96.25474   4  NA          <NA>     0.041
#> 5   0.05  FALSE  TRUE      14.84306   5  NA          <NA>     0.092
#> 6    0.1  FALSE  TRUE      15.31225   6  NA          <NA>     0.089
#> 7      0   TRUE FALSE      40.51518   7  NA          <NA>     0.089
#> 8   0.05   TRUE FALSE      68.00069   8  NA          <NA>     0.088
#> 9    0.1   TRUE FALSE      55.42210   9  NA          <NA>     0.084
#> 10     0  FALSE FALSE      96.25474  10  NA          <NA>     0.065
#> 11  0.05  FALSE FALSE      56.25758  11  NA          <NA>     0.099
#> 12   0.1  FALSE FALSE      49.66780  12  NA          <NA>     0.091

Preprocessing wrapper functions

If you have written a preprocessing wrapper that you might want to use from time to time it's a good idea to encapsulate it in an own function as shown below. If you think your preprocessing method is something others might want to use as well and should be integrated into mlr just contact us.

makePreprocWrapperScale = function(learner, center = TRUE, scale = TRUE) {
  trainfun = function(data, target, args = list(center, scale)) {
    cns = colnames(data)
    nums = setdiff(cns[sapply(data, is.numeric)], target)
    x = as.matrix(data[, nums, drop = FALSE])
    x = scale(x, center = args$center, scale = args$scale)
    control = args
    if (is.logical(control$center) && control$center)
      control$center = attr(x, "scaled:center")
    if (is.logical(control$scale) && control$scale)
      control$scale = attr(x, "scaled:scale")
    data = data[, setdiff(cns, nums), drop = FALSE]
    data = cbind(data, as.data.frame(x))
    return(list(data = data, control = control))
  }
  predictfun = function(data, target, args, control) {
    cns = colnames(data)
    nums = cns[sapply(data, is.numeric)]
    x = as.matrix(data[, nums, drop = FALSE])
    x = scale(x, center = control$center, scale = control$scale)
    data = data[, setdiff(cns, nums), drop = FALSE]
    data = cbind(data, as.data.frame(x))
    return(data)
  }
  makePreprocWrapper(
    learner,
    train = trainfun,
    predict = predictfun,
    par.set = makeParamSet(
      makeLogicalLearnerParam("center"),
      makeLogicalLearnerParam("scale")
    ),
    par.vals = list(center = center, scale = scale)
  )
}

lrn = makePreprocWrapperScale("classif.lda")
train(lrn, iris.task)
#> Model for learner.id=classif.lda.preproc; learner.class=PreprocWrapper
#> Trained on: task.id = iris-example; obs = 150; features = 4
#> Hyperparameters: center=TRUE,scale=TRUE