First we need to install the `cranlogs`

package using `devtools`

:

Now let’s load all the packages we will need:

Do obtain a neat table of all available learners in *mlr* we can call `listLearners()`

.
This table also contains a column with the needed packages for each learner separated with a `,`

.

*Note:* You might get some warnings here because you likely did not install all packages that *mlr* suggests – which is totally fine.

Now we can obtain the download counts from the *rstudio cran mirror*, i.e. from the last month.
We use `data.table`

to easily sum up the download counts of each day.

As some learners need multiple packages we will use the download count of the package with the least downloads.

Let’s put these numbers in our table:

*Here are the first 5 rows of the table:*

class | name | package | downloads |
---|---|---|---|

surv.coxph | Cox Proportional Hazard Model | survival | 153681 |

classif.naiveBayes | Naive Bayes | e1071 | 102249 |

classif.svm | Support Vector Machines (libsvm) | e1071 | 102249 |

regr.svm | Support Vector Machines (libsvm) | e1071 | 102249 |

classif.lda | Linear Discriminant Analysis | MASS | 55852 |

Now let’s get rid of the duplicates introduced by the distinction of the type *classif*, *regr* and we already have our…

The top 20 according to the *rstudio cran mirror*:

class | name | package | downloads |
---|---|---|---|

surv.coxph | Cox Proportional Hazard Model | survival | 153681 |

classif.naiveBayes | Naive Bayes | e1071 | 102249 |

classif.svm | Support Vector Machines (libsvm) | e1071 | 102249 |

classif.lda | Linear Discriminant Analysis | MASS | 55852 |

classif.qda | Quadratic Discriminant Analysis | MASS | 55852 |

classif.randomForest | Random Forest | randomForest | 52094 |

classif.gausspr | Gaussian Processes | kernlab | 44812 |

classif.ksvm | Support Vector Machines | kernlab | 44812 |

classif.lssvm | Least Squares Support Vector Machine | kernlab | 44812 |

cluster.kkmeans | Kernel K-Means | kernlab | 44812 |

regr.rvm | Relevance Vector Machine | kernlab | 44812 |

classif.cvglmnet | GLM with Lasso or Elasticnet Regularization (Cross Validated Lambda) | glmnet | 41179 |

classif.glmnet | GLM with Lasso or Elasticnet Regularization | glmnet | 41179 |

surv.cvglmnet | GLM with Regularization (Cross Validated Lambda) | glmnet | 41179 |

surv.glmnet | GLM with Regularization | glmnet | 41179 |

classif.cforest | Random forest based on conditional inference trees | party | 36492 |

classif.ctree | Conditional Inference Trees | party | 36492 |

regr.cforest | Random Forest Based on Conditional Inference Trees | party | 36492 |

regr.mob | Model-based Recursive Partitioning Yielding a Tree with Fitted Models Associated with each Terminal Node | party,modeltools | 36492 |

surv.cforest | Random Forest based on Conditional Inference Trees | party,survival | 36492 |

As we are just looking for the packages let’s compress the table a bit further and come to our…

*Here are the first 20 rows of the table:*

package | downloads | learners |
---|---|---|

survival | 153681 | surv.coxph |

e1071 | 102249 | classif.naiveBayes,classif.svm,regr.svm |

MASS | 55852 | classif.lda,classif.qda |

randomForest | 52094 | classif.randomForest,regr.randomForest |

kernlab | 44812 | classif.gausspr,classif.ksvm,classif.lssvm,cluster.kkmeans,regr.gausspr,regr.ksvm,regr.rvm |

glmnet | 41179 | classif.cvglmnet,classif.glmnet,regr.cvglmnet,regr.glmnet,surv.cvglmnet,surv.glmnet |

party | 36492 | classif.cforest,classif.ctree,multilabel.cforest,regr.cforest,regr.ctree |

party,modeltools | 36492 | regr.mob |

party,survival | 36492 | surv.cforest |

fpc | 33664 | cluster.dbscan |

rpart | 28609 | classif.rpart,regr.rpart,surv.rpart |

RWeka | 20583 | classif.IBk,classif.J48,classif.JRip,classif.OneR,classif.PART,cluster.Cobweb,cluster.EM,cluster.FarthestFirst,cluster.SimpleKMeans,cluster.XMeans,regr.IBk |

gbm | 19554 | classif.gbm,regr.gbm,surv.gbm |

nnet | 19538 | classif.multinom,classif.nnet,regr.nnet |

caret,pls | 18106 | classif.plsdaCaret |

pls | 18106 | regr.pcr,regr.plsr |

FNN | 16107 | classif.fnn,regr.fnn |

earth | 15824 | regr.earth |

neuralnet | 15506 | classif.neuralnet |

class | 14493 | classif.knn,classif.lvq1 |

And of course we want to have a small visualization:

This is not really representative of how popular each learner is, as some packages have multiple purposes (e.g. multiple learners).
Furthermore it would be great to have access to the trending list.
Also *most stars at GitHub* gives a better view of what the developers are interested in.
Looking for machine learning packages we see there e.g: xgboost, h2o and tensorflow.

First, let me introduce you to multilabel classification. This is a classification problem, where every instance can have more than one label. Let’s have a look at a typical multilabel dataset (which I, of course, download from the OpenML server):

Here I took the *scene* dataset, where the features represent color information of pictures and the targets could be objects like *beach*, *sunset*, and so on.

As you can see above, one defining property of a multilabel dataset is, that the target variables (which are called *labels*) are binary. If you want to use your own data set, make sure to encode these variables in *logical*, where *TRUE* indicates the relevance of a label.

The basic idea behind many multilabel classification algorithms is to make use of possible correlation between labels. Maybe a learner is very good at predicting label 1, but rather bad at predicting label 2. If label 1 and label 2 are highly correlated, it may be beneficial to predict label 1 first and use this prediction as a feature for predicting label 2.

This approach is the main concept behind the so called *problem transformation methods*. The multilabel problem is transformed into binary classification problems, one for each label. Predicted labels are used as features for predicting other labels.

We implemented the following problem transformation methods:

- Classifier chains
- Nested stacking
- Dependent binary relevance
- Stacking

How these methods are defined, can be read in the mlr tutorial or in more detail in our paper. Enough theory now, let’s apply these methods on our dataset.

First we need to create a multilabel task.

We set a seed, because the classifier chain wrapper uses a random chain order. Next, we train a learner. I chose the classifier chain approach together with a decision tree for the binary classification problems.

Now let’s train and predict on our dataset:

We also implemented common multilabel performance measures. Here is a list with available multilabel performance measures:

Here is how the classifier chains method performed:

Now let’s see if it can be beneficial to use predicted labels as features for other labels. Let us compare the performance of the classifier chains method with the binary relevance method (this method does not use predicted labels as features).

As can be seen here, it could indeed make sense to use more elaborate methods for multilabel classification, since classifier chains beat the binary relevance methods in all of these measures (Note, that hamming loss and subset01 are loss measures!).

Here I’ll show you how to use resampling methods in the multilabel setting. Resampling methods are key for assessing the performance of a learning algorithm. To read more about resampling, see the page on our tutorial.

First, we need to define a resampling strategy. I chose subsampling, which is also called Monte-Carlo cross-validation. The dataset is split into training and test set at a predefined ratio. The learner is trained on the training set, the performance is evaluated with the test set. This whole process is repeated many times and the performance values are averaged. In mlr this is done the following way:

Now we can choose a measure, which shall be resampled. All there is left to do is to run the resampling:

If you followed the mlr tutorial or if you are already familiar with mlr, you most likely saw, that using resampling in the multilabel setting isn’t any different than generally using resampling in mlr. Many methods, which are available in mlr, like preprocessing, tuning or benchmark experiments can also be used for multilabel datasets and the good thing here is: the syntax stays the same!

]]>You can vote for your favorite logo on GitHub by reacting to the logo with a +1.

Thanks to Hannah Atkin for designing the logos!

]]>We recently published **mlrMBO** on CRAN.
As a normal package it normally operates inside of R, but with this post I want to demonstrate how **mlrMBO** can be used to optimize an external application.
At the same time I will highlight some issues you can likely run into.

First of all we need a bash script that we want to optimize.
This tutorial will only run on Unix systems (Linux, OSX etc.) but should also be informative for windows users.
The following code will write a tiny bash script that uses `bc`

to calculate $sin(x_1-1) + (x_1^2 + x_2^2)$ and write the result “hidden” in a sentence (`The result is 12.34!`

) in a `result.txt`

text file.

Now we need a R function that starts the script, reads the result from the text file and returns it.

This function uses `stringi`

and *regular expressions* to match the result within the sentence.
Depending on the output different strategies to read the result make sense.
XML files can usually be accessed with `XML::xmlParse`

, `XML::getNodeSet`

, `XML::xmlAttrs`

etc. using `XPath`

queries.
Sometimes the good old `read.table()`

is also sufficient.
If, for example, the output is written in a file like this:

You can easily use `source()`

like that:

which will return a list with the entries `$value1`

and `$value2`

.

To evaluate the function from within **mlrMBO** it has to be wrapped in **smoof** function.
The smoof function also contains information about the bounds and scales of the domain of the objective function defined in a *ParameterSet*.

If you run this locally, you will see that the console output generated by our shell script directly appears in the R-console. This can be helpful but also annoying.

If a lot of output is generated during a single call of `system()`

it might even crash R.
To avoid that I suggest to redirect the output into a file.
This way no output is lost and the R console does not get flooded.
We can simply achieve that by replacing the `command`

in the function `runScript`

from above with the following code:

Now everything is set so we can proceed with the usual MBO setup:

Also you might not want to bothered having to start *R* and run this script manually so what I would recommend is saving all above as an R-script plus some lines that write the output in a JSON file like this:

Let’s assume we saved all of that above as an R-script under the name `runMBO.R`

(actually it is available as a gist).

Then you can simply run it from the command line:

As an extra the script in the gist also contains a simple handler for command line arguments. In this case you can define the number of optimization iterations and the maximal allowed time in seconds for the optimization. You can also define the seed to make runs reproducible:

If you want to build a more advanced command line interface you might want to have a look at docopt.

To clean up all the files generated by this script you can run:

]]>`batchtools`

.
The data that we will use here is stored on the open machine learning platform openml.org and we can download it together with information on what to do with it in form of a task.

If you have a small project and don’t need to parallelize, you might want to just look at the previous blog post called mlr loves OpenML.

The following packages are needed for this:

Now we download five OpenML-tasks from OpenML:

In a next step we need to create the so called registry. What this basically does is to create a folder with a certain subfolder structure.

Now you should have a new folder in your working directory with the name `parallel_benchmarking_blogpost`

and the following subfolders / files:

```
parallel_benchmarking_blogpost/
├── algorithms
├── exports
├── external
├── jobs
├── logs
├── problems
├── registry.rds
├── results
└── updates
```

In the next step we get to the interesting point. We need to define…

- the
**problems**, which in our case are simply the OpenML tasks we downloaded. - the
**algorithm**, which with mlr and OpenML is quite simply achieved using`makeLearner`

and`runTaskMlr`

. We do not have to save the run results (result of applying the learner to the task), but we can directly upload it to OpenML where the results are automatically evaluated. - the machine learning
**experiment**, i.e. in our case which parameters do we want to set for which learner. As an example here, we will look at the*ctree*algorithm from the*party*package and see whether Bonferroni correction (correction for multiple testing) helps getting better predictions and also we want to check whether we need a tree that has more than two leaf nodes (`stump = FALSE`

) or if a small tree is enough (`stump = TRUE`

).

Now we can simply run our experiment:

While your job is running, you can check the progress using `getStatus()`

.
As soon as `getStatus()`

tells us that all our runs are done, we can collect the results of our experiment from OpenML.
To be able to do this we need to collect the run IDs from the uploaded runs we did during the experiment.
Also we want to add the info of the parameters used (`getJobPars()`

).

With the run ID information we can now grab the evaluations from OpenML and plot for example the parameter settings against the predictive accuracy.

We see that the only data set where a stump is good enough is the pc1 data set. For the madelon data set Bonferroni correction helps. For the others it does not seem to matter. You can check out the results online by going to the task websites (e.g. for task 9976 for the madelon data set go to openml.org/t/9976) or the run websites (e.g. openml.org/r/1852889).

]]>The key features of **mlrMBO** are:

- Global optimization of expensive Black-Box functions.
- Multi-Criteria Optimization.
- Parallelization through multi-point proposals.
- Support for optimization over categorical variables using random forests as a surrogate.

For examples covering different scenarios we have Vignettes that are also available as an online documentation.
For **mlr** users **mlrMBO** is especially interesting for hyperparameter optimization.

**mlrMBO** for **mlr** hyperparameter tuning was already used in an earlier blog post.
Nonetheless we want to provide a small toy example to demonstrate the work flow of **mlrMBO** in this post.

First, we define an objective function that we are going to minimize:

To define the objective function we use `makeSingleObjectiveFunction`

from the neat package **smoof**, which gives us the benefit amongst others to be able to directly visualize the function.
*If you happen to be in need of functions to optimize and benchmark your optimization algorithm I recommend you to have a look at the package!*

Let’s start with the configuration of the optimization:

The optimization has to so start with an initial design.
**mlrMBO** can automatically create one but here we are going to use a randomly sampled LHS design of our own:

The points demonstrate how the initial design already covers the search space but is missing the area of the global minimum.
Before we can start the Bayesian optimization we have to set the surrogate learner to *Kriging*.
Therefore we use an *mlr* regression learner.
In fact, with *mlrMBO* you can use any regression learner integrated in *mlr* as a surrogate allowing for many special optimization applications.

*Note:* **mlrMBO** can automatically determine a good surrogate learner based on the search space defined for the objective function.
For a purely numeric domain it would have chosen *Kriging* as well with some slight modifications to make it a bit more stable against numerical problems that can occur during optimization.

Finally, we can start the optimization run:

We can see that we have found the global optimum of $y = -0.414964$ at $x = (-1.35265,0)$ quite sufficiently.
Let’s have a look at the points mlrMBO evaluated.
Therefore we can use the `OptPath`

which stores all information about all evaluations during the optimization run:

It is interesting to see, that for this run the algorithm first went to the local minimum on the top right in the 6th and 7th iteration but later, thanks to the explorative character of the *Expected Improvement*, found the real global minimum.

That is all good, but how do other optimization strategies perform?

Grid search is seldom a good idea. But especially for hyperparameter tuning it is still used. Probably because it kind of gives you the feeling that you know what is going on and have not left out any important area of the search space. In reality the grid is usually so sparse that it leaves important areas untouched as you can see in this example:

It is no surprise, that the grid search could not cover the search space well enough and we only reach a bad result.

With the random search you could always be lucky but in average the optimum is not reached if smarter optimization strategies work well.

… for stochastic optimization algorithms can only be achieved by repeating the runs.
**mlrMBO** is stochastic as the initial design is generated randomly and the fit of the Kriging surrogate is also not deterministic.
Furthermore we should include other optimization strategies like a genetic algorithm and direct competitors like `rBayesOpt`

.
An extensive benchmark is available in our **mlrMBO** paper.
The examples here are just meant to demonstrate the package.

If you want to contribute to **mlrMBO** we ware always open to suggestions and pull requests on github.
You are also invited to fork the repository and build and extend your own optimizer based on our toolbox.

`mlr`

, can be used to tune a xgboost model with random search in parallel (using 16 cores). The R script scores rank 90 (of 3251) on the Kaggle leaderboard.
- Use good software
- Understand the objective
- Create and select features
- Tune your model
- Validate your model
- Ensemble different models
- Track your progress

Whether you choose R, Python or another language to work on Kaggle, you will most likely need to leverage quite a few packages to follow best practices in machine learning. To save time, you should use ‘software’ that offers a standardized and well-tested interface for the important steps in your workflow:

- Benchmarking different machine learning algorithms (learners)
- Optimizing hyperparameters of learners
- Feature selection, feature engineering and dealing with missing values
- Resampling methods for validation of learner performance
- Parallelizing the points above

Examples of ‘software’ that implement the steps above and more:

- For python: scikit-learn (http://scikit-learn.org/stable/auto_examples).
- For R:
`mlr`

(https://mlr-org.github.io/mlr-tutorial) or`caret`

.

To develop a good understanding of the Kaggle challenge, you should:

- Understand the problem domain:
- Read the description and try to understand the aim of the competition.
- Keep reading the forum and looking into scripts/kernels of others, learn from them!
- Domain knowledge might help you (i.e., read publications about the topic, wikipedia is also ok).
- Use external data if allowed (e.g., google trends, historical weather data).

- Explore the dataset:
- Which features are numerical, categorical, ordinal or time dependent?
- Decide how to handle
*missing values*. Some options:- Impute missing values with the mean, median or with values that are out of range (for numerical features).
- Interpolate missing values if the feature is time dependent.
- Introduce a new category for the missing values or use the mode (for categorical features).

- Do exploratory data analysis (for the lazy: wait until someone else uploads an EDA kernel).
- Insights you learn here will inform the rest of your workflow (creating new features).

Make sure you choose an approach that directly optimizes the measure of interest! Example:

- The
**median**minimizes the mean absolute error**(MAE)**and the**mean**minimizes the mean squared error**(MSE)**. - By default, many regression algorithms predict the expected
**mean**but there are counterparts that predict the expected**median**(e.g., linear regression vs. quantile regression). - For strange measures: Use algorithms where you can implement your own objective function, see e.g.

In many kaggle competitions, finding a “magic feature” can dramatically increase your ranking. Sometimes, better data beats better algorithms! You should therefore try to introduce new features containing valuable information (which can’t be found by the model) or remove noisy features (which can decrease model performance):

- Concat several columns
- Multiply/Add several numerical columns
- Count NAs per row
- Create dummy features from factor columns
- For time series, you could try
- to add the weekday as new feature
- to use rolling mean or median of any other numerical feature
- to add features with a lag…

- Remove noisy features:
*Feature selection / filtering*

Typically you can focus on a single model (e.g. *xgboost*) and tune its hyperparameters for optimal performance.

- Aim: Find the best hyperparameters that, for the given data set, optimize the pre-defined performance measure.
- Problem: Some models have many hyperparameters that can be tuned.
- Possible solutions:
*Grid search or random search*- Advanced procedures such as
*irace*or*mbo (bayesian optimization)*

Good machine learning models not only work on the data they were trained on, but also on unseen (test) data that was not used for training the model. When you use training data to make any kind of decision (like feature or model selection, hyperparameter tuning, …), the data becomes less valuable for generalization to unseen data. So if you just use the public leaderboard for testing, you might overfit to the public leaderboard and lose many ranks once the private leaderboard is revealed. A better approach is to use validation to get an estimate of performane on unseen data:

- First figure out how the Kaggle data was split into train and test data. Your resampling strategy should follow the same method if possible. So if kaggle uses, e.g. a feature for splitting the data, you should not use random samples for creating cross-validation folds.
- Set up a
*resampling procedure*, e.g., cross-validation (CV) to measure your model performance - Improvements on your local CV score should also lead to improvements on the leaderboard.
- If this is not the case, you can try
- several CV folds (e.g., 3-fold, 5-fold, 8-fold)
- repeated CV (e.g., 3 times 3-fold, 3 times 5-fold)
- stratified CV

`mlr`

offers nice*visualizations to benchmark*different algorithms.

After training many different models, you might want to ensemble them into one strong model using one of these methods:

- simple averaging or voting
- finding optimal weights for averaging or voting
- stacking

A kaggle project might get quite messy very quickly, because you might try and prototype many different ideas. To avoid getting lost, make sure to keep track of:

- What preprocessing steps were used to create the data
- What model was used for each step
- What values were predicted in the test file
- What local score did the model achieve
- What public score did the model achieve

If you do not want to use a tool like git, at least make sure you create subfolders for each prototype. This way you can later analyse which models you might want to ensemble or use for your final commits for the competition.

]]>Conducting research openly and reproducibly is becoming the gold standard in academic research. Practicing open and reproducible research, however, is hard. OpenML.org (Open Machine Learning) is an online platform that aims at making the part of research involving data and analyses easier. It automatically connects data sets, research tasks, algorithms, analyses and results and allows users to access all components including meta information through a REST API in a machine readable and standardized format. Everyone can see, work with and expand other people’s work in a fully reproducible way.

At useR!2017, we will we will present an R package to interface the OpenML platform and illustrate its usage both as a stand-alone package and in combination with the mlr machine learning package. Furthermore, we show how the OpenML package allows R users to easily search, download and upload machine learning datasets.

]]>In 2017, we are hosting the workshop at LMU Munich. The workshop will run from 6 March to 10 March 2017 (potentially including the sunday before and the saturday at the end), hosted by the Ludwig-Maximilians-University Munich.

*Important Dates:*

- Address: Geschwister-Scholl-Platz 1, Room: M203.
- Start: 6th of March: 10:00 AM.

**It is also possible to arrive on Saturday or Sunday, as we already have the rooms and are able to work there. But this is totally optional and the official workshop starts on Monday. Same thing for the Saturday after the workshop.**

The mlr developer team is quite international: Germany, USA, Canada. The time difference between these countries sometimes makes it hard to communicate and develop new features. The idea for this workshop or sprint was to have the possibility to talk about the project status, future and structure, exterminate imperishable bugs and start developing some fancy features.

The workshop is mainly geared towards the already existing crowd of developers, but it might also be a perfect opportunity to join the team - if you want to help. We are always looking for competent persons to collaborate with. If you are interested, please register in the following form, and we are looking forward to seeing you in Munich! Join us for the excellent team and the nice Bavarian beer and food in Munich!

We want to thank all our sponsors. Without them this workshop would not be possible.

You can find the workshop schedule here

Our last workshop in 2016 was in Palermo, Italy. Twelve people from the developer team met from the 8th to 15th of August to work on and with mlr and it was more like a sprint, as our core developers meet to get stuff done. We closed a lot of issues and developed new features that we will release with version 2.10 of mlr. Thanks to all participants: Giuseppe Casalicchio, Janek Thomas, Xudong Sun, Jakob Bossek, Bernd Bischl, Jakob Richter, Michel Lang, Philipp Probst, Julia Schiffner, Lars Kotthoff, Zachary Jones, Pascal Kerschke! We also head a great time in a great city aside from the workshop, where we have been travelling around the city for sightseeing and enjoying the beach and nice food of Palermo.

]]>Operator Based Machine Learning Pipeline Construction

We aim to change the way we are currently doing data preprocessing in mlr. Have a look at the proposal linked above for more details.

If you are interested in doing this project, have a look at the tests and requirements.

]]>