# Parallel benchmarking with OpenML and mlr

With this post I want to show you how to benchmark several learners (or learners with different parameter settings) using several data sets in a structured and parallelized fashion. For this we want to use batchtools.

The data that we will use here is stored on the open machine learning platform openml.org and we can download it together with information on what to do with it in form of a task.

If you have a small project and don’t need to parallelize, you might want to just look at the previous blog post called mlr loves OpenML.

The following packages are needed for this:

In a next step we need to create the so called registry. What this basically does is to create a folder with a certain subfolder structure.

Now you should have a new folder in your working directory with the name parallel_benchmarking_blogpost and the following subfolders / files:

parallel_benchmarking_blogpost/
├── algorithms
├── exports
├── external
├── jobs
├── logs
├── problems
├── registry.rds
├── results


In the next step we get to the interesting point. We need to define…

• the algorithm, which with mlr and OpenML is quite simply achieved using makeLearner and runTaskMlr. We do not have to save the run results (result of applying the learner to the task), but we can directly upload it to OpenML where the results are automatically evaluated.
• the machine learning experiment, i.e. in our case which parameters do we want to set for which learner. As an example here, we will look at the ctree algorithm from the party package and see whether Bonferroni correction (correction for multiple testing) helps getting better predictions and also we want to check whether we need a tree that has more than two leaf nodes (stump = FALSE) or if a small tree is enough (stump = TRUE).
While your job is running, you can check the progress using getStatus(). As soon as getStatus() tells us that all our runs are done, we can collect the results of our experiment from OpenML. To be able to do this we need to collect the run IDs from the uploaded runs we did during the experiment. Also we want to add the info of the parameters used (getJobPars()).