Benchmark Experiments

In a benchmark experiment different learning methods are applied to one or several data sets with the aim to compare and rank the algorithms with respect to one or more performance measures.

In mlr a benchmark experiment can be conducted by calling function benchmark on a list of Learners and a list of Tasks. benchmark basically executes resample for each combination of Learner and Task. You can specify an individual resampling strategy for each Task and select one or multiple performance measures to be calculated.

Conducting benchmark experiments

We start with a small example. Two learners, linear discriminant analysis (lda) and a classification tree (rpart), are applied to one classification problem (sonar.task). As resampling strategy we choose "Holdout". The performance is thus calculated on a single randomly sampled test data set.

In the example below we create a resample description (ResampleDesc), which is automatically instantiated by benchmark. The instantiation is done only once per Task, i.e., the same training and test sets are used for all learners. It is also possible to directly pass a ResampleInstance.

If you would like to use a fixed test data set instead of a randomly selected one, you can create a suitable ResampleInstance through function makeFixedHoldoutInstance.

## Two learners to be compared
lrns = list(makeLearner("classif.lda"), makeLearner("classif.rpart"))

## Choose the resampling strategy
rdesc = makeResampleDesc("Holdout")

## Conduct the benchmark experiment
bmr = benchmark(lrns, sonar.task, rdesc)
#> Task: Sonar-example, Learner: classif.lda
#> Resampling: holdout
#> Measures:             mmce
#> [Resample] iter 1:    0.3000000
#> 
#> Aggregated Result: mmce.test.mean=0.3000000
#> 
#> Task: Sonar-example, Learner: classif.rpart
#> Resampling: holdout
#> Measures:             mmce
#> [Resample] iter 1:    0.2857143
#> 
#> Aggregated Result: mmce.test.mean=0.2857143
#> 

bmr
#>         task.id    learner.id mmce.test.mean
#> 1 Sonar-example   classif.lda      0.3000000
#> 2 Sonar-example classif.rpart      0.2857143

For convenience, if you don't want to pass any additional arguments to makeLearner, you don't need to generate the Learners explicitly, but it's sufficient to provide the learner name. In the above example we could also have written:

## Vector of strings
lrns = c("classif.lda", "classif.rpart")

## A mixed list of Learner objects and strings works, too
lrns = list(makeLearner("classif.lda", predict.type = "prob"), "classif.rpart")

bmr = benchmark(lrns, sonar.task, rdesc)
#> Task: Sonar-example, Learner: classif.lda
#> Resampling: holdout
#> Measures:             mmce
#> [Resample] iter 1:    0.2571429
#> 
#> Aggregated Result: mmce.test.mean=0.2571429
#> 
#> Task: Sonar-example, Learner: classif.rpart
#> Resampling: holdout
#> Measures:             mmce
#> [Resample] iter 1:    0.2714286
#> 
#> Aggregated Result: mmce.test.mean=0.2714286
#> 

bmr
#>         task.id    learner.id mmce.test.mean
#> 1 Sonar-example   classif.lda      0.2571429
#> 2 Sonar-example classif.rpart      0.2714286

In the printed summary table every row corresponds to one pair of Task and Learner. The entries show the mean misclassification error (mmce), the default performance measure for classification, on the test data set.

The result bmr is an object of class BenchmarkResult. Basically, it contains a list of lists of ResampleResult objects, first ordered by Task and then by Learner.

Making experiments reproducible

Typically, we would want our experiment results to be reproducible. mlr obeys the set.seed function, so make sure to use set.seed at the beginning of your script if you would like your results to be reproducible.

Note that if you are using parallel computing, you may need to adjust how you call set.seed depending on your usecase. One possibility is to use set.seed(123, "L'Ecuyer") in order to ensure the results are reproducible for each child process. See the examples in mclapply for more information on reproducibility and parallel computing.

Accessing benchmark results

mlr provides several accessor functions, named getBMR<WhatToExtract>, that permit to retrieve information for further analyses. This includes for example the performances or predictions of the learning algorithms under consideration.

Learner performances

Let's have a look at the benchmark result above. getBMRPerformances returns individual performances in resampling runs, while getBMRAggrPerformances gives the aggregated values.

getBMRPerformances(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#>   iter      mmce
#> 1    1 0.2571429
#> 
#> $`Sonar-example`$classif.rpart
#>   iter      mmce
#> 1    1 0.2714286

getBMRAggrPerformances(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> mmce.test.mean 
#>      0.2571429 
#> 
#> $`Sonar-example`$classif.rpart
#> mmce.test.mean 
#>      0.2714286

Since we used holdout as resampling strategy, individual and aggregated performance values coincide.

By default, nearly all "getter" functions return a nested list, with the first level indicating the task and the second level indicating the learner. If only a single learner or, as in our case a single task is considered, setting drop = TRUE simplifies the result to a flat list.

getBMRPerformances(bmr, drop = TRUE)
#> $classif.lda
#>   iter      mmce
#> 1    1 0.2571429
#> 
#> $classif.rpart
#>   iter      mmce
#> 1    1 0.2714286

Often it is more convenient to work with data.frames. You can easily convert the result structure by setting as.df = TRUE.

getBMRPerformances(bmr, as.df = TRUE)
#>         task.id    learner.id iter      mmce
#> 1 Sonar-example   classif.lda    1 0.2571429
#> 2 Sonar-example classif.rpart    1 0.2714286

getBMRAggrPerformances(bmr, as.df = TRUE)
#>         task.id    learner.id mmce.test.mean
#> 1 Sonar-example   classif.lda      0.2571429
#> 2 Sonar-example classif.rpart      0.2714286

Predictions

Per default, the BenchmarkResult contains the learner predictions. If you do not want to keep them, e.g., to conserve memory, set keep.pred = FALSE when calling benchmark.

You can access the predictions using function getBMRPredictions. Per default, you get a nested list of ResamplePrediction objects. As above, you can use the drop or as.df options to simplify the result.

getBMRPredictions(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> Resampled Prediction for:
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
#> predict.type: prob
#> threshold: M=0.50,R=0.50
#> time (mean): 0.01
#>    id truth     prob.M      prob.R response iter  set
#> 1 127     M 0.98673000 0.013270001        M    1 test
#> 2 159     M 0.99659179 0.003408211        M    1 test
#> 3  81     R 0.55436799 0.445632009        M    1 test
#> 4 207     M 0.98660766 0.013392337        M    1 test
#> 5  74     R 0.94120073 0.058799272        M    1 test
#> 6 154     M 0.03862365 0.961376347        R    1 test
#> ... (#rows: 70, #cols: 7)
#> 
#> $`Sonar-example`$classif.rpart
#> Resampled Prediction for:
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE
#> predict.type: response
#> threshold: 
#> time (mean): 0.01
#>    id truth response iter  set
#> 1 127     M        R    1 test
#> 2 159     M        R    1 test
#> 3  81     R        R    1 test
#> 4 207     M        M    1 test
#> 5  74     R        R    1 test
#> 6 154     M        M    1 test
#> ... (#rows: 70, #cols: 5)

head(getBMRPredictions(bmr, as.df = TRUE))
#>         task.id  learner.id  id truth     prob.M      prob.R response iter
#> 1 Sonar-example classif.lda 127     M 0.98673000 0.013270001        M    1
#> 2 Sonar-example classif.lda 159     M 0.99659179 0.003408211        M    1
#> 3 Sonar-example classif.lda  81     R 0.55436799 0.445632009        M    1
#> 4 Sonar-example classif.lda 207     M 0.98660766 0.013392337        M    1
#> 5 Sonar-example classif.lda  74     R 0.94120073 0.058799272        M    1
#> 6 Sonar-example classif.lda 154     M 0.03862365 0.961376347        R    1
#>    set
#> 1 test
#> 2 test
#> 3 test
#> 4 test
#> 5 test
#> 6 test

It is also easily possible to access results for certain learners or tasks via their IDs. For this purpose many "getter" functions have a learner.ids and a task.ids argument.

head(getBMRPredictions(bmr, learner.ids = "classif.rpart", as.df = TRUE))
#>         task.id    learner.id  id truth response iter  set
#> 1 Sonar-example classif.rpart 127     M        R    1 test
#> 2 Sonar-example classif.rpart 159     M        R    1 test
#> 3 Sonar-example classif.rpart  81     R        R    1 test
#> 4 Sonar-example classif.rpart 207     M        M    1 test
#> 5 Sonar-example classif.rpart  74     R        R    1 test
#> 6 Sonar-example classif.rpart 154     M        M    1 test

If you don't like the default IDs, you can set the IDs of learners and tasks via the id option of makeLearner and make*Task. Moreover, you can conveniently change the ID of a Learner via function setLearnerId.

IDs

The IDs of all Learners, Tasks and Measures in a benchmark experiment can be retrieved as follows:

getBMRTaskIds(bmr)
#> [1] "Sonar-example"

getBMRLearnerIds(bmr)
#> [1] "classif.lda"   "classif.rpart"

getBMRMeasureIds(bmr)
#> [1] "mmce"

Fitted models

Per default the BenchmarkResult also contains the fitted models for all learners on all tasks. If you do not want to keep them set models = FALSE when calling benchmark. The fitted models can be retrieved by function getBMRModels. It returns a (possibly nested) list of WrappedModel objects.

getBMRModels(bmr)
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> $`Sonar-example`$classif.lda[[1]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = Sonar-example; obs = 138; features = 60
#> Hyperparameters: 
#> 
#> 
#> $`Sonar-example`$classif.rpart
#> $`Sonar-example`$classif.rpart[[1]]
#> Model for learner.id=classif.rpart; learner.class=classif.rpart
#> Trained on: task.id = Sonar-example; obs = 138; features = 60
#> Hyperparameters: xval=0

getBMRModels(bmr, drop = TRUE)
#> $classif.lda
#> $classif.lda[[1]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = Sonar-example; obs = 138; features = 60
#> Hyperparameters: 
#> 
#> 
#> $classif.rpart
#> $classif.rpart[[1]]
#> Model for learner.id=classif.rpart; learner.class=classif.rpart
#> Trained on: task.id = Sonar-example; obs = 138; features = 60
#> Hyperparameters: xval=0

getBMRModels(bmr, learner.ids = "classif.lda")
#> $`Sonar-example`
#> $`Sonar-example`$classif.lda
#> $`Sonar-example`$classif.lda[[1]]
#> Model for learner.id=classif.lda; learner.class=classif.lda
#> Trained on: task.id = Sonar-example; obs = 138; features = 60
#> Hyperparameters:

Learners and measures

Moreover, you can extract the employed Learners and Measures.

getBMRLearners(bmr)
#> $classif.lda
#> Learner classif.lda from package MASS
#> Type: classif
#> Name: Linear Discriminant Analysis; Short name: lda
#> Class: classif.lda
#> Properties: twoclass,multiclass,numerics,factors,prob
#> Predict-Type: prob
#> Hyperparameters: 
#> 
#> 
#> $classif.rpart
#> Learner classif.rpart from package rpart
#> Type: classif
#> Name: Decision Tree; Short name: rpart
#> Class: classif.rpart
#> Properties: twoclass,multiclass,missings,numerics,factors,ordered,prob,weights,featimp
#> Predict-Type: response
#> Hyperparameters: xval=0

getBMRMeasures(bmr)
#> [[1]]
#> Name: Mean misclassification error
#> Performance measure: mmce
#> Properties: classif,classif.multi,req.pred,req.truth
#> Minimize: TRUE
#> Best: 0; Worst: 1
#> Aggregated by: test.mean
#> Arguments: 
#> Note: Defined as: mean(response != truth)

Merging benchmark results

Sometimes after completing a benchmark experiment it turns out that you want to extend it by another Learner or another Task. In this case you can perform an additional benchmark experiment and then use function mergeBenchmarkResults to combine the results to a single BenchmarkResult object that can be accessed and analyzed as usual.

For example in the benchmark experiment above we applied lda and rpart to the sonar.task. We now perform a second experiment using a random forest and quadratic discriminant analysis (qda) and merge the results.

## First benchmark result
bmr
#>         task.id    learner.id mmce.test.mean
#> 1 Sonar-example   classif.lda      0.2571429
#> 2 Sonar-example classif.rpart      0.2714286

## Benchmark experiment for the additional learners
lrns2 = list(makeLearner("classif.randomForest"), makeLearner("classif.qda"))
bmr2 = benchmark(lrns2, sonar.task, rdesc, show.info = FALSE)
bmr2
#>         task.id           learner.id mmce.test.mean
#> 1 Sonar-example classif.randomForest      0.1428571
#> 2 Sonar-example          classif.qda      0.2714286

## Merge the results
mergeBenchmarkResults(list(bmr, bmr2))
#>         task.id           learner.id mmce.test.mean
#> 1 Sonar-example          classif.lda      0.2571429
#> 2 Sonar-example        classif.rpart      0.2714286
#> 3 Sonar-example classif.randomForest      0.1428571
#> 4 Sonar-example          classif.qda      0.2714286

Note that in the above examples in each case a resample description was passed to the benchmark function. For this reason lda and rpart were most likely evaluated on a different training/test set pair than random forest and qda.

Differing training/test set pairs across learners pose an additional source of variation in the results, which can make it harder to detect actual performance differences between learners. Therefore, if you suspect that you will have to extend your benchmark experiment by another Learner later on it's probably easiest to work with ResampleInstances from the start. These can be stored and used for any additional experiments.

Alternatively, if you used a resample description in the first benchmark experiment you could also extract the ResampleInstances from the BenchmarkResult bmr and pass these to all further benchmark calls.

rin = getBMRPredictions(bmr)[[1]][[1]]$instance
rin
#> Resample instance for 208 cases.
#> Resample description: holdout with 0.67 split rate.
#> Predict: test
#> Stratification: FALSE

## Benchmark experiment for the additional random forest
bmr3 = benchmark(lrns2, sonar.task, rin, show.info = FALSE)
bmr3
#>         task.id           learner.id mmce.test.mean
#> 1 Sonar-example classif.randomForest      0.2000000
#> 2 Sonar-example          classif.qda      0.5142857

## Merge the results
mergeBenchmarkResults(list(bmr, bmr3))
#>         task.id           learner.id mmce.test.mean
#> 1 Sonar-example          classif.lda      0.2571429
#> 2 Sonar-example        classif.rpart      0.2714286
#> 3 Sonar-example classif.randomForest      0.2000000
#> 4 Sonar-example          classif.qda      0.5142857

Benchmark analysis and visualization

mlr offers several ways to analyze the results of a benchmark experiment. This includes visualization, ranking of learning algorithms and hypothesis tests to assess performance differences between learners.

In order to demonstrate the functionality we conduct a slightly larger benchmark experiment with three learning algorithms that are applied to five classification tasks.

Example: Comparing lda, rpart and random Forest

We consider linear discriminant analysis (lda), classification trees (rpart), and random forests (randomForest). Since the default learner IDs are a little long, we choose shorter names in the R code below.

We use five classification tasks. Three are already provided by mlr, two more data sets are taken from package mlbench and converted to Tasks by function convertMLBenchObjToTask.

For all tasks 10-fold cross-validation is chosen as resampling strategy. This is achieved by passing a single resample description to benchmark, which is then instantiated automatically once for each Task. This way, the same instance is used for all learners applied to a single task.

It is also possible to choose a different resampling strategy for each Task by passing a list of the same length as the number of tasks that can contain both resample descriptions and resample instances.

We use the mean misclassification error mmce as primary performance measure, but also calculate the balanced error rate (ber) and the training time (timetrain).

## Create a list of learners
lrns = list(
  makeLearner("classif.lda", id = "lda"),
  makeLearner("classif.rpart", id = "rpart"),
  makeLearner("classif.randomForest", id = "randomForest")
)

## Get additional Tasks from package mlbench
ring.task = convertMLBenchObjToTask("mlbench.ringnorm", n = 600)
wave.task = convertMLBenchObjToTask("mlbench.waveform", n = 600)

tasks = list(iris.task, sonar.task, pid.task, ring.task, wave.task)
rdesc = makeResampleDesc("CV", iters = 10)
meas = list(mmce, ber, timetrain)
bmr = benchmark(lrns, tasks, rdesc, meas, show.info = FALSE)
bmr
#>                        task.id   learner.id mmce.test.mean ber.test.mean
#> 1                 iris-example          lda     0.02000000    0.02222222
#> 2                 iris-example        rpart     0.08000000    0.07555556
#> 3                 iris-example randomForest     0.05333333    0.05250000
#> 4             mlbench.ringnorm          lda     0.35000000    0.34605671
#> 5             mlbench.ringnorm        rpart     0.17333333    0.17313632
#> 6             mlbench.ringnorm randomForest     0.05833333    0.05806121
#> 7             mlbench.waveform          lda     0.19000000    0.18257244
#> 8             mlbench.waveform        rpart     0.28833333    0.28765247
#> 9             mlbench.waveform randomForest     0.16500000    0.16306057
#> 10 PimaIndiansDiabetes-example          lda     0.22778537    0.27148893
#> 11 PimaIndiansDiabetes-example        rpart     0.25133288    0.28967870
#> 12 PimaIndiansDiabetes-example randomForest     0.23685919    0.27543146
#> 13               Sonar-example          lda     0.24619048    0.23986694
#> 14               Sonar-example        rpart     0.30785714    0.31153361
#> 15               Sonar-example randomForest     0.17785714    0.17442696
#>    timetrain.test.mean
#> 1               0.0040
#> 2               0.0050
#> 3               0.0270
#> 4               0.0100
#> 5               0.0127
#> 6               0.4045
#> 7               0.0105
#> 8               0.0144
#> 9               0.4434
#> 10              0.0059
#> 11              0.0080
#> 12              0.3648
#> 13              0.0189
#> 14              0.0186
#> 15              0.2615

From the aggregated performance values we can see that for the iris- and PimaIndiansDiabetes-example linear discriminant analysis performs well while for all other tasks the random forest seems superior. Training takes longer for the random forest than for the other learners.

In order to draw any conclusions from the average performances at least their variability has to be taken into account or, preferably, the distribution of performance values across resampling iterations.

The individual performances on the 10 folds for every task, learner, and measure are retrieved below.

perf = getBMRPerformances(bmr, as.df = TRUE)
head(perf)
#>        task.id learner.id iter      mmce       ber timetrain
#> 1 iris-example        lda    1 0.0000000 0.0000000     0.003
#> 2 iris-example        lda    2 0.1333333 0.1666667     0.003
#> 3 iris-example        lda    3 0.0000000 0.0000000     0.004
#> 4 iris-example        lda    4 0.0000000 0.0000000     0.004
#> 5 iris-example        lda    5 0.0000000 0.0000000     0.004
#> 6 iris-example        lda    6 0.0000000 0.0000000     0.004

A closer look at the result reveals that the random forest outperforms the classification tree in every instance, while linear discriminant analysis performs better than rpart most of the time. Additionally lda sometimes even beats the random forest. With increasing size of such benchmark experiments, those tables become almost unreadable and hard to comprehend.

mlr features some plotting functions to visualize results of benchmark experiments that you might find useful. Moreover, mlr offers statistical hypothesis tests to assess performance differences between learners.

Integrated plots

Plots are generated using ggplot2. Further customization, such as renaming plot elements or changing colors, is easily possible.

Visualizing performances

plotBMRBoxplots creates box or violin plots which show the distribution of performance values across resampling iterations for one performance measure and for all learners and tasks (and thus visualize the output of getBMRPerformances).

Below are both variants, box and violin plots. The first plot shows the mmce and the second plot the balanced error rate (ber). Moreover, in the second plot we color the boxes according to the learner.ids.

plotBMRBoxplots(bmr, measure = mmce)

plot of chunk unnamed-chunk-16

plotBMRBoxplots(bmr, measure = ber, style = "violin", pretty.names = FALSE) +
  aes(color = learner.id) +
  theme(strip.text.x = element_text(size = 8))

plot of chunk unnamed-chunk-16

Note that by default the measure names and the learner short.names are used as axis labels.

mmce$name
#> [1] "Mean misclassification error"

mmce$id
#> [1] "mmce"

getBMRLearnerIds(bmr)
#> [1] "lda"          "rpart"        "randomForest"

getBMRLearnerShortNames(bmr)
#> [1] "lda"   "rpart" "rf"

If you prefer the ids like, e.g., mmce and ber set pretty.names = FALSE (as done for the second plot). Of course you can also use the ggplot2 functionality like the ylab function to choose completely different labels.

One question which comes up quite often is how to change the panel headers (which default to the Task IDs) and the learner names on the x-axis. For example looking at the above plots we would like to remove the "example" suffixes and the "mlbench" prefixes from the panel headers. Moreover, we want uppercase learner labels. Currently, the probably simplest solution is to change the factor levels of the plotted data as shown below.

plt = plotBMRBoxplots(bmr, measure = mmce)
head(plt$data)
#>        task.id learner.id iter      mmce       ber timetrain
#> 1 iris-example        lda    1 0.0000000 0.0000000     0.003
#> 2 iris-example        lda    2 0.1333333 0.1666667     0.003
#> 3 iris-example        lda    3 0.0000000 0.0000000     0.004
#> 4 iris-example        lda    4 0.0000000 0.0000000     0.004
#> 5 iris-example        lda    5 0.0000000 0.0000000     0.004
#> 6 iris-example        lda    6 0.0000000 0.0000000     0.004

levels(plt$data$task.id) = c("Iris", "Ringnorm", "Waveform", "Diabetes", "Sonar")
levels(plt$data$learner.id) = c("LDA", "CART", "RF")

plt + ylab("Error rate")

plot of chunk unnamed-chunk-18

Visualizing aggregated performances

The aggregated performance values (resulting from getBMRAggrPerformances) can be visualized by function plotBMRSummary. This plot draws one line for each task on which the aggregated values of one performance measure for all learners are displayed. By default, the first measure in the list of Measures passed to benchmark is used, in our example mmce. Moreover, a small vertical jitter is added to prevent overplotting.

plotBMRSummary(bmr)

plot of chunk unnamed-chunk-19

Calculating and visualizing ranks

Additional to the absolute performance, relative performance, i.e., ranking the learners is usually of interest and might provide valuable additional insight.

Function convertBMRToRankMatrix calculates ranks based on aggregated learner performances of one measure. We choose the mean misclassification error (mmce). The rank structure can be visualized by plotBMRRanksAsBarChart.

m = convertBMRToRankMatrix(bmr, mmce)
m
#>              iris-example mlbench.ringnorm mlbench.waveform
#> lda                     1                3                2
#> rpart                   3                2                3
#> randomForest            2                1                1
#>              PimaIndiansDiabetes-example Sonar-example
#> lda                                    1             2
#> rpart                                  3             3
#> randomForest                           2             1

Methods with best performance, i.e., with lowest mmce, are assigned the lowest rank. Linear discriminant analysis is best for the iris and PimaIndiansDiabetes-examples while the random forest shows best results on the remaining tasks.

plotBMRRanksAsBarChart with option pos = "tile" shows a corresponding heat map. The ranks are displayed on the x-axis and the learners are color-coded.

plotBMRRanksAsBarChart(bmr, pos = "tile")

plot of chunk unnamed-chunk-21

A similar plot can also be obtained via plotBMRSummary. With option trafo = "rank" the ranks are displayed instead of the aggregated performances.

plotBMRSummary(bmr, trafo = "rank", jitter = 0)

plot of chunk unnamed-chunk-22

Alternatively, you can draw stacked bar charts (the default) or bar charts with juxtaposed bars (pos = "dodge") that are better suited to compare the frequencies of learners within and across ranks.

plotBMRRanksAsBarChart(bmr)
plotBMRRanksAsBarChart(bmr, pos = "dodge")

plot of chunk unnamed-chunk-23