Integrating Another Learner
In order to integrate a learning algorithm into mlr some interface code has to be written. Three functions are mandatory for each learner.
 First, define a new learner class with a name, description, capabilities, parameters, and a few other things. (An object of this class can then be generated by makeLearner.)
 Second, you need to provide a function that calls the learner function and builds the model given data (which makes it possible to invoke training by calling mlr's train function).
 Finally, a prediction function that returns predicted values given new data is required (which enables invoking prediction by calling mlr's predict function).
Technically, integrating a learning method means introducing a new S3 class and implementing the corresponding methods for the generic functions makeRLerner, trainLearner, and predictLearner. Therefore we start with a quick overview of the involved classes and constructor functions.
Classes, constructors, and naming schemes
As you already know makeLearner generates an object of class Learner.
class(makeLearner(cl = "classif.lda"))
#> [1] "classif.lda" "RLearnerClassif" "RLearner" "Learner"
class(makeLearner(cl = "regr.lm"))
#> [1] "regr.lm" "RLearnerRegr" "RLearner" "Learner"
class(makeLearner(cl = "surv.coxph"))
#> [1] "surv.coxph" "RLearnerSurv" "RLearner" "Learner"
class(makeLearner(cl = "cluster.kmeans"))
#> [1] "cluster.kmeans" "RLearnerCluster" "RLearner" "Learner"
class(makeLearner(cl = "multilabel.rFerns"))
#> [1] "multilabel.rFerns" "RLearnerMultilabel" "RLearner"
#> [4] "Learner"
The first element of each class attribute vector is the name of the learner
class passed to the cl
argument of makeLearner.
Obviously, this adheres to the naming conventions
"classif.<R_method_name>"
for classification,"multilabel.<R_method_name>"
for multilabel classification,"regr.<R_method_name>"
for regression,"surv.<R_method_name>"
for survival analysis, and"cluster.<R_method_name>"
for clustering.
Additionally, there exist intermediate classes that reflect the type of learning problem, i.e., all classification learners inherit from RLearnerClassif, all regression learners from RLearnerRegr and so on. Their superclasses are RLearner and finally Learner. For all these (sub)classes there exist constructor functions makeRLearner, makeRLearnerClassif, makeRLearneRegr etc. that are called internally by makeLearner.
A short side remark: As you might have noticed there does not exist a special learner class for costsensitive classification (costsens) with examplespecific costs. This type of learning task is currently exclusively handled through wrappers like makeCostSensWeightedPairsWrapper.
In the following we show how to integrate learners for the five types of learning tasks mentioned above. Defining a completely new type of learner that has special properties and does not fit into one of the existing schemes is of course possible, but much more advanced and not covered here.
We use a classification example to explain some general principles (so even if you are interested in integrating a learner for another type of learning task you might want to read the following section). Examples for other types of learning tasks are shown later on.
Classification
We show how the Linear Discriminant Analysis from
package MASS has been integrated
into the classification learner classif.lda
in mlr as an example.
Definition of the learner
The minimal information required to define a learner is the mlr name of the learner, its package, the parameter set, and the set of properties of your learner. In addition, you may provide a humanreadable name, a short name and a note with information relevant to users of the learner.
First, name your learner. According to the naming conventions above the name starts with
classif.
and we choose classif.lda
.
Second, we need to define the parameters of the learner. These are any options that can be set when running it to change how it learns, how input is interpreted, how and what output is generated, and so on. mlr provides a number of functions to define parameters, a complete list can be found in the documentation of LearnerParam of the ParamHelpers package.
In our example, we have discrete and numeric parameters, so we use makeDiscreteLearnerParam and makeNumericLearnerParam to incorporate the complete description of the parameters. We include all possible values for discrete parameters and lower and upper bounds for numeric parameters. Strictly speaking it is not necessary to provide bounds for all parameters and if this information is not available they can be estimated, but providing accurate and specific information here makes it possible to tune the learner much better (see the section on tuning).
Next, we add information on the properties of the learner (see also the section
on learners). Which types of features are supported (numerics,
factors)? Are case weights supported? Are class weights supported? Can the method deal
with missing values in the features and deal with NA's in a meaningful way (not na.omit
)?
Are oneclass, twoclass, multiclass problems supported? Can the learner predict
posterior probabilities?
If the learner supports class weights the name of the relevant learner parameter
can be specified via argument class.weights.param
.
Below is the complete code for the definition of the LDA learner. It has one
discrete parameter, method
, and two continuous ones, nu
and tol
. It
supports classification problems with two or more classes and can deal with
numeric and factor explanatory variables. It can predict posterior
probabilities.
makeRLearner.classif.lda = function() {
makeRLearnerClassif(
cl = "classif.lda",
package = "MASS",
par.set = makeParamSet(
makeDiscreteLearnerParam(id = "method", default = "moment", values = c("moment", "mle", "mve", "t")),
makeNumericLearnerParam(id = "nu", lower = 2, requires = quote(method == "t")),
makeNumericLearnerParam(id = "tol", default = 1e4, lower = 0),
makeDiscreteLearnerParam(id = "predict.method", values = c("plugin", "predictive", "debiased"),
default = "plugin", when = "predict"),
makeLogicalLearnerParam(id = "CV", default = FALSE, tunable = FALSE)
),
properties = c("twoclass", "multiclass", "numerics", "factors", "prob"),
name = "Linear Discriminant Analysis",
short.name = "lda",
note = "Learner param 'predict.method' maps to 'method' in predict.lda."
)
}
Creating the training function of the learner
Once the learner has been defined, we need to tell mlr how to call it to
train a model. The name of the function has to start with trainLearner.
,
followed by the mlr name of the learner as defined above (classif.lda
here). The prototype of the function looks as follows.
function(.learner, .task, .subset, .weights = NULL, ...) { }
This function must fit a model on the data of the task .task
with regard to
the subset defined in the integer vector .subset
and the parameters passed
in the ...
arguments. Usually, the data should be extracted from the task
using getTaskData. This will take care of any subsetting as well. It must
return the fitted model. mlr assumes no special data type for the return
value  it will be passed to the predict function we are going to define below,
so any special code the learner may need can be encapsulated there.
For our example, the definition of the function looks like this. In addition to the data of the task, we also need the formula that describes what to predict. We use the function getTaskFormula to extract this from the task.
trainLearner.classif.lda = function(.learner, .task, .subset, .weights = NULL, ...) {
f = getTaskFormula(.task)
MASS::lda(f, data = getTaskData(.task, .subset), ...)
}
Creating the prediction method
Finally, the prediction function needs to be defined. The name of this function
starts with predictLearner.
, followed again by the mlr name of the
learner. The prototype of the function is as follows.
function(.learner, .model, .newdata, ...) { }
It must predict for the new observations in the data.frame
.newdata
with
the wrapped model .model
, which is returned from the training function.
The actual model the learner built is stored in the $learner.model
member
and can be accessed simply through .model$learner.model
.
For classification, you have to return a factor of predicted classes if
.learner$predict.type
is "response"
, or a matrix of predicted
probabilities if .learner$predict.type
is "prob"
and this type of
prediction is supported by the learner. In the latter case the matrix must have
the same number of columns as there are classes in the task and the columns have
to be named by the class names.
The definition for LDA looks like this. It is pretty much just a straight passthrough of the arguments to the predict function and some extraction of prediction data depending on the type of prediction requested.
predictLearner.classif.lda = function(.learner, .model, .newdata, predict.method = "plugin", ...) {
p = predict(.model$learner.model, newdata = .newdata, method = predict.method, ...)
if (.learner$predict.type == "response")
return(p$class) else return(p$posterior)
}
Regression
The main difference for regression is that the type of predictions are different (numeric instead of labels or probabilities) and that not all of the properties are relevant. In particular, whether one, two, or multiclass problems and posterior probabilities are supported is not applicable.
Apart from this, everything explained above applies. Below is the definition for the earth learner from the earth package.
makeRLearner.regr.earth = function() {
makeRLearnerRegr(
cl = "regr.earth",
package = "earth",
par.set = makeParamSet(
makeLogicalLearnerParam(id = "keepxy", default = FALSE, tunable = FALSE),
makeNumericLearnerParam(id = "trace", default = 0, upper = 10, tunable = FALSE),
makeIntegerLearnerParam(id = "degree", default = 1L, lower = 1L),
makeNumericLearnerParam(id = "penalty"),
makeIntegerLearnerParam(id = "nk", lower = 0L),
makeNumericLearnerParam(id = "thres", default = 0.001),
makeIntegerLearnerParam(id = "minspan", default = 0L),
makeIntegerLearnerParam(id = "endspan", default = 0L),
makeNumericLearnerParam(id = "newvar.penalty", default = 0),
makeIntegerLearnerParam(id = "fast.k", default = 20L, lower = 0L),
makeNumericLearnerParam(id = "fast.beta", default = 1),
makeDiscreteLearnerParam(id = "pmethod", default = "backward",
values = c("backward", "none", "exhaustive", "forward", "seqrep", "cv")),
makeIntegerLearnerParam(id = "nprune")
),
properties = c("numerics", "factors"),
name = "Multivariate Adaptive Regression Splines",
short.name = "earth",
note = ""
)
}
trainLearner.regr.earth = function(.learner, .task, .subset, .weights = NULL, ...) {
f = getTaskFormula(.task)
earth::earth(f, data = getTaskData(.task, .subset), ...)
}
predictLearner.regr.earth = function(.learner, .model, .newdata, ...) {
predict(.model$learner.model, newdata = .newdata)[, 1L]
}
Again most of the data is passed straight through to/from the train/predict functions of the learner.
Survival analysis
For survival analysis, you have to return socalled linear predictors in order to compute
the default measure for this task type, the cindex (for
.learner$predict.type
== "response"
). For .learner$predict.type
== "prob"
,
there is no substantially meaningful measure (yet). You may either ignore this case or return
something like predicted survival curves (cf. example below).
There are three properties that are specific to survival learners: "rcens", "lcens" and "icens", defining the type(s) of censoring a learner can handle  right, left and/or interval censored.
Let's have a look at how the Cox Proportional Hazard Model from
package survival has been integrated
into the survival learner surv.coxph
in mlr as an example:
makeRLearner.surv.coxph = function() {
makeRLearnerSurv(
cl = "surv.coxph",
package = "survival",
par.set = makeParamSet(
makeDiscreteLearnerParam(id = "ties", default = "efron", values = c("efron", "breslow", "exact")),
makeLogicalLearnerParam(id = "singular.ok", default = TRUE),
makeNumericLearnerParam(id = "eps", default = 1e09, lower = 0),
makeNumericLearnerParam(id = "toler.chol", default = .Machine$double.eps^0.75, lower = 0),
makeIntegerLearnerParam(id = "iter.max", default = 20L, lower = 1L),
makeNumericLearnerParam(id = "toler.inf", default = sqrt(.Machine$double.eps^0.75), lower = 0),
makeIntegerLearnerParam(id = "outer.max", default = 10L, lower = 1L),
makeLogicalLearnerParam(id = "model", default = FALSE, tunable = FALSE),
makeLogicalLearnerParam(id = "x", default = FALSE, tunable = FALSE),
makeLogicalLearnerParam(id = "y", default = TRUE, tunable = FALSE)
),
properties = c("missings", "numerics", "factors", "weights", "prob", "rcens"),
name = "Cox Proportional Hazard Model",
short.name = "coxph",
note = ""
)
}
trainLearner.surv.coxph = function(.learner, .task, .subset, .weights = NULL, ...) {
f = getTaskFormula(.task)
data = getTaskData(.task, subset = .subset)
if (is.null(.weights)) {
survival::coxph(formula = f, data = data, ...)
} else {
survival::coxph(formula = f, data = data, weights = .weights, ...)
}
}
predictLearner.surv.coxph = function(.learner, .model, .newdata, ...) {
predict(.model$learner.model, newdata = .newdata, type = "lp", ...)
}
Clustering
For clustering, you have to return a numeric vector with the IDs of the clusters that the respective datum has been assigned to. The numbering should start at 1.
Below is the definition for the FarthestFirst learner
from the RWeka package. Weka
starts the IDs of the clusters at 0, so we add 1 to the predicted clusters.
RWeka has a different way of setting learner parameters; we use the special
Weka_control
function to do this.
makeRLearner.cluster.FarthestFirst = function() {
makeRLearnerCluster(
cl = "cluster.FarthestFirst",
package = "RWeka",
par.set = makeParamSet(
makeIntegerLearnerParam(id = "N", default = 2L, lower = 1L),
makeIntegerLearnerParam(id = "S", default = 1L, lower = 1L),
makeLogicalLearnerParam(id = "outputdebuginfo", default = FALSE, tunable = FALSE)
),
properties = c("numerics"),
name = "FarthestFirst Clustering Algorithm",
short.name = "farthestfirst"
)
}
trainLearner.cluster.FarthestFirst = function(.learner, .task, .subset, .weights = NULL, ...) {
ctrl = RWeka::Weka_control(...)
RWeka::FarthestFirst(getTaskData(.task, .subset), control = ctrl)
}
predictLearner.cluster.FarthestFirst = function(.learner, .model, .newdata, ...) {
as.integer(predict(.model$learner.model, .newdata, ...)) + 1L
}
Multilabel classification
As stated in the multilabel section, multilabel classification methods can be divided into problem transformation methods and algorithm adaptation methods.
At this moment the only problem transformation method implemented in mlr is the binary relevance method. Integrating more of these methods requires good knowledge of the architecture of the mlr package.
The integration of an algorithm adaptation multilabel classification learner is easier and
works very similar to the normal multiclassclassification.
In contrast to the multiclass case, not all of the learner properties are relevant.
In particular, whether one, two, or multiclass problems are supported is not applicable.
Furthermore the prediction function output must be a matrix
with each prediction of a label in one column and the names of the labels
as column names. If .learner$predict.type
is "response"
the predictions must
be logical. If .learner$predict.type
is "prob"
and this type of
prediction is supported by the learner, the matrix must consist of predicted
probabilities.
Below is the definition of the rFerns learner from the rFerns package, which does not support probability predictions.
makeRLearner.multilabel.rFerns = function() {
makeRLearnerMultilabel(
cl = "multilabel.rFerns",
package = "rFerns",
par.set = makeParamSet(
makeIntegerLearnerParam(id = "depth", default = 5L),
makeIntegerLearnerParam(id = "ferns", default = 1000L)
),
properties = c("numerics", "factors", "ordered"),
name = "Random ferns",
short.name = "rFerns",
note = ""
)
}
trainLearner.multilabel.rFerns = function(.learner, .task, .subset, .weights = NULL, ...) {
d = getTaskData(.task, .subset, target.extra = TRUE)
rFerns::rFerns(x = d$data, y = as.matrix(d$target), ...)
}
predictLearner.multilabel.rFerns = function(.learner, .model, .newdata, ...) {
as.matrix(predict(.model$learner.model, .newdata, ...))
}
Creating a new method for extracting feature importance values
Some learners, for example decision trees and random forests, can calculate feature importance values, which can be extracted from a fitted model using function getFeatureImportance.
If your newly integrated learner supports this you need to
 add
"featimp"
to the learner properties and  implement a new S3 method for function getFeatureImportanceLearner (which later is called internally by getFeatureImportance)
in order to make this work.
This method takes the Learner .learner
, the WrappedModel .model
and potential further arguments and calculates or extracts the feature importance.
It must return a named vector of importance values.
Below are two simple examples.
In case of "classif.rpart"
the feature importance values can be easily extracted from the
fitted model.
getFeatureImportanceLearner.classif.rpart = function(.learner, .model, ...) {
mod = getLearnerModel(.model, more.unwrap = TRUE)
mod$variable.importance
}
For the random forest from package randomForestSRC function vimp is called.
getFeatureImportanceLearner.classif.randomForestSRC = function(.learner, .model, ...) {
mod = getLearnerModel(.model, more.unwrap = TRUE)
randomForestSRC::vimp(mod, ...)$importance[, "all"]
}
Creating a new method for extracting outofbag predictions
Many ensemble learners generate outofbag predictions (OOB predictions) automatically. mlr provides the function getOOBPreds to access these predictions in the mlr framework.
If your newly integrated learner is able to calculate OOB predictions and you want to be able to access them in mlr via getOOBPreds you need to
 add
"oobpreds"
to the learner properties and  implement a new S3 method for function getOOBPredsLearner (which later is called internally by getOOBPreds).
This method takes the Learner .learner
and the WrappedModel
.model
and extracts the OOB predictions.
It must return the predictions in the same format as the predictLearner function.
getOOBPredsLearner.classif.randomForest = function(.learner, .model) {
if (.learner$predict.type == "response") {
m = getLearnerModel(.model, more.unwrap = TRUE)
unname(m$predicted)
} else {
getLearnerModel(.model, more.unwrap = TRUE)$votes
}
}
Registering your learner
If your interface code to a new learning algorithm exists only locally, i.e., it is not (yet) merged into mlr or does not live in an extra package with a proper namespace you might want to register the new S3 methods to make sure that these are found by, e.g., listLearners. You can do this as follows:
registerS3method("makeRLearner", "<awesome_new_learner_class>", makeRLearner.<awesome_new_learner_class>)
registerS3method("trainLearner", "<awesome_new_learner_class>", trainLearner.<awesome_new_learner_class>)
registerS3method("predictLearner", "<awesome_new_learner_class>", predictLearner.<awesome_new_learner_class>)
If you have written more methods, for example in order to extract feature importance values or outofbag predictions these also need to be registered in the same manner, for example:
registerS3method("getFeatureImportanceLearner", "<awesome_new_learner_class>",
getFeatureImportanceLearner.<awesome_new_learner_class>)
For the new learner to work with parallelization, you may have to export the new methods explicitly:
parallelExport("trainLearner.<awesome_new_learner_class>", "predictLearner.<awesome_new_learner_class>")
Further information for developers
If you haven't written a learner interface for private use only, but intend to send a pull request to have it included in the mlr package there are a few things to take care of, most importantly unit testing!
For general information about contributing to the package, unit testing, version control setup and the like please also read the coding guidelines in the mlr Wiki.

The R file containing the interface code should adhere to the naming convention
RLearner_<type>_<learner_name>.R
, e.g.,RLearner_classif_lda.R
, see for example https://github.com/mlrorg/mlr/blob/master/R/RLearner_classif_lda.R and contain the necessary roxygen@export
tags to register the S3 methods in the NAMESPACE. 
The learner interfaces should work out of the box without requiring any parameters to be set, e.g.,
train("classif.lda", iris.task)
should run. Sometimes, this makes it necessary to change or set some additional defaults as explained above and  very important  informing the user about this in thenote
. 
The parameter set of the learner should be as complete as possible.

Every learner interface must be unit tested.
Unit testing
The tests make sure that we get the same results when the learner is invoked through the mlr interface and when using the original functions. If you are not familiar or want to learn more about unit testing and package testthat have a look at the Testing chapter in Hadley Wickham's R packages.
In mlr all unit tests are in the following directory:
https://github.com/mlrorg/mlr/tree/master/tests/testthat.
For each learner interface there is an individual file whose name follows the scheme
test_<type>_<learner_name>.R
, for example
https://github.com/mlrorg/mlr/blob/master/tests/testthat/test_classif_lda.R.
Below is a snippet from the tests of the lda interface https://github.com/mlrorg/mlr/blob/master/tests/testthat/test_classif_lda.R.
test_that("classif_lda", {
requirePackagesOrSkip("MASS", default.method = "load")
set.seed(getOption("mlr.debug.seed"))
m = MASS::lda(formula = multiclass.formula, data = multiclass.train)
set.seed(getOption("mlr.debug.seed"))
p = predict(m, newdata = multiclass.test)
testSimple("classif.lda", multiclass.df, multiclass.target, multiclass.train.inds, p$class)
testProb("classif.lda", multiclass.df, multiclass.target, multiclass.train.inds, p$posterior)
})
The tests make use of numerous helper objects and helper functions.
All of these are defined in the helper_
files in
https://github.com/mlrorg/mlr/blob/master/tests/testthat/.
In the above code the first line just loads package MASS or skips the test if the package is
not available.
The objects multiclass.formula
, multiclass.train
, multiclass.test
etc. are defined
in https://github.com/mlrorg/mlr/blob/master/tests/testthat/helper_objects.R.
We tried to choose fairly selfexplanatory names:
For example multiclass
indicates a multiclass classification problem, multiclass.train
contains data for training, multiclass.formula
a formula object etc.
The test fits an lda model on the training set and makes predictions on the test set
using the original functions lda and predict.lda.
The helper functions testSimple
and testProb
perform training and prediction on the same
data using the mlr interface  testSimple
for predict.type = "response
and testProbs
for
predict.type = "prob"
 and check if the predicted class labels and probabilities coincide
with the outcomes p$class
and p$posterior
.
In order to get reproducible results seeding is required for many learners.
The "mlr.debug.seed"
works as follows:
When invoking the tests the option "mlr.debug.seed"
is set
(see https://github.com/mlrorg/mlr/blob/master/tests/testthat/helper_zzz.R), and
set.seed(getOption("mlr.debug.seed"))
is used to specify the seed.
Internally, mlr's
train and
predict.WrappedModel
functions check if the "mlr.debug.seed"
option is set and if yes, also specify the seed.
Note that the option "mlr.debug.seed"
is only set for testing, so no seeding happens in
normal usage of mlr.
Let's look at a second example. Many learners have parameters that are commonly changed or tuned and it is important to make sure that these are passed through correctly. Below is a snippet from https://github.com/mlrorg/mlr/blob/master/tests/testthat/test_regr_randomForest.R.
test_that("regr_randomForest", {
requirePackagesOrSkip("randomForest", default.method = "load")
parset.list = list(
list(),
list(ntree = 5, mtry = 2),
list(ntree = 5, mtry = 4),
list(proximity = TRUE, oob.prox = TRUE),
list(nPerm = 3)
)
old.predicts.list = list()
for (i in 1:length(parset.list)) {
parset = parset.list[[i]]
pars = list(formula = regr.formula, data = regr.train)
pars = c(pars, parset)
set.seed(getOption("mlr.debug.seed"))
m = do.call(randomForest::randomForest, pars)
set.seed(getOption("mlr.debug.seed"))
p = predict(m, newdata = regr.test, type = "response")
old.predicts.list[[i]] = p
}
testSimpleParsets("regr.randomForest", regr.df, regr.target,
regr.train.inds, old.predicts.list, parset.list)
})
All tested parameter configurations are collected in the parset.list
.
In order to make sure that the default parameter configuration is tested the first element
of the parset.list
is an empty list.
Then we simply loop over all parameter settings and store the resulting predictions in
old.predicts.list
.
Again the helper function testSimpleParsets
does the same using the mlr interface
and compares the outcomes.
Additional to tests for individual learners we also have general tests that loop through all
integrated learners and make for example sure that learners have the correct properties
(e.g. a learner with property "factors"
can cope with factor features,
a learner with property "weights"
takes observation weights into account properly etc.).
For example https://github.com/mlrorg/mlr/blob/master/tests/testthat/test_learners_all_classif.R
runs through all classification learners.
Similar tests exist for all types of learning methods like regression, cluster and survival analysis
as well as multilabel classification.
In order to run all tests for, e.g., classification learners on your machine you can invoke the tests from within R by
devtools::test("mlr", filter = "classif")
or from the command line using Michel's rt tool
rtest filter=classif