The field of Machine Learning has grown tremendously over the last years, and is a key component of data-driven science. Data analysis algorithms are being invented and used every day, but their results and experiments are published almost exclusively in journals or separated repositories. However, data by itself has no value. It’s the ever-changing ecosystem surrounding data that gives it meaning.

OpenML is a networked science platform that aims to connect and organize all this knowledge online, linking data, algorithms, results and people into a coherent whole so that scientists and practitioners can easy build on prior work and collaborate in real time online.

OpenML has an online interface on openml.org, and is integrated in the most popular machine learning tools and statistical environments such as R, Python, WEKA, MOA and RapidMiner. This allows researchers and students to easily import and export data from these tools and share them with others online, fully integrated into the context of the state of the art. On OpenML, researchers can connect to each other, start projects, and build on the results of others. It automatically keeps track of how often shared work is reused so that researchers can follow the wider impact of their work and become more visible.

The OpenML workshop is organized as a hackathon, an event where participants from many scientific domains present their goals and ideas, and then work on them in small teams for many hours or days at a time. Participants bring their laptops, learn how to use OpenML in tutorials, and build upon that to create something great to push their research forward. The complete OpenML development team will be available to get them started, answer questions, and implement new features on the fly.

The next OpenML Workshop will run from 9 October to 13 Oktober 2017, see also here.

]]>- Simple Random Search
- Grid Search
- Iterated F-Racing (via
**irace**) - Sequential Model-Based Optimization (via
**mlrMBO**)

Also the search space is easily definable and customizable for each of the 60+ learners of mlr using the ParamSets from the **ParamHelpers** Package.

The only drawback and shortcoming of **mlr** in comparison to **caret** in this regard is that **mlr** itself does not have defaults for the search spaces.
This is where **mlrHyperopt** comes into play.

**mlrHyperopt** offers

- default search spaces for the most important learners in
**mlr**, - parameter tuning in one line of code,
- and an API to add and access custom search spaces from the mlrHyperopt Database.

Tuning can be done in one line relying on the defaults.
The default will automatically minimize the *missclassification rate*.

We can find out what `hyperopt`

did by inspecting the `res`

object.

Depending on the parameter space **mlrHyperopt** will automatically decide for a suitable tuning method:

As the search space defined in the ParamSet is only numeric, sequential Bayesian optimization was chosen. We can look into the evaluated parameter configurations and we can visualize the optimization run.

The upper left plot shows the distribution of the tried settings in the search space and contour lines indicate where regions of good configurations are located. The lower right plot shows the value of the objective (the miss-classification rate) and how it decreases over the time. This also shows nicely that wrong settings can lead to bad results.

If you just want to use **mlrHyperopt** to access the default parameter search spaces from the
Often you don’t want to rely on the default procedures of **mlrHyperopt** and just incorporate it into your **mlr**-workflow.
Here is one example how you can use the default search spaces for an easy benchmark:

As we can see we were able to improve the performance of xgboost and the nnet without any additional knowledge on what parameters we should tune. Especially for nnet improved performance is noticable.

Some recommended additional reads

- Vignette on getting started and also how to contribute by uploading alternative or additional ParConfigs.
- How to work with ParamSets as part of the Vignette.
- The slides of the useR 2017 Talk on
**mlrHyperopt**.

Stefan and me started working on this project late summer 2016 as part of a practical course we attended for our Master’s program. We enjoyed the work on this project and will continue to maintain and extend our app in the future. However, after almost one year of work our application got a versatile tool and it is time to present it to a broader audience. To introduce you to the workflow and main features of our app, we uploaded a video series to our youtube channel. The videos are little tutorials that illustrate the workflow in form of a use case: We used the titanic data set from the kaggle competition as example data to show you step by step how it can be analyzed with our application.

The first video gives a small introduction and shows you how data can be imported:

In the next tutorial you will learn how to visualise your data and preprocess it:

The third and fourth screencasts show you how to create your task and how to construct and modify our built-in learning algorithms:

The fifth part of our tutorials shows you how to tune your learners to find suitable parameter settings for your given training set:

The sixth video gives you detailed information on how to actually train models on your task, predict on new data and plot model diagnostic and prediction plots:

The seventh video runs a benchmark experiment, to show you how to compare different learners in our application:

The last tutorial briefly demonstrates how to render an interactive report from your analysis done with our app:

I hope you enjoyed watching the videos and learned how to make use of our application. If you like working with our app please leave us a star and follow us on github

]]>First we need to install the `cranlogs`

package using `devtools`

:

Now let’s load all the packages we will need:

Do obtain a neat table of all available learners in *mlr* we can call `listLearners()`

.
This table also contains a column with the needed packages for each learner separated with a `,`

.

*Note:* You might get some warnings here because you likely did not install all packages that *mlr* suggests – which is totally fine.

Now we can obtain the download counts from the *rstudio cran mirror*, i.e. from the last month.
We use `data.table`

to easily sum up the download counts of each day.

As some learners need multiple packages we will use the download count of the package with the least downloads.

Let’s put these numbers in our table:

*Here are the first 5 rows of the table:*

class | name | package | downloads |
---|---|---|---|

surv.coxph | Cox Proportional Hazard Model | survival | 153681 |

classif.naiveBayes | Naive Bayes | e1071 | 102249 |

classif.svm | Support Vector Machines (libsvm) | e1071 | 102249 |

regr.svm | Support Vector Machines (libsvm) | e1071 | 102249 |

classif.lda | Linear Discriminant Analysis | MASS | 55852 |

Now let’s get rid of the duplicates introduced by the distinction of the type *classif*, *regr* and we already have our…

The top 20 according to the *rstudio cran mirror*:

class | name | package | downloads |
---|---|---|---|

surv.coxph | Cox Proportional Hazard Model | survival | 153681 |

classif.naiveBayes | Naive Bayes | e1071 | 102249 |

classif.svm | Support Vector Machines (libsvm) | e1071 | 102249 |

classif.lda | Linear Discriminant Analysis | MASS | 55852 |

classif.qda | Quadratic Discriminant Analysis | MASS | 55852 |

classif.randomForest | Random Forest | randomForest | 52094 |

classif.gausspr | Gaussian Processes | kernlab | 44812 |

classif.ksvm | Support Vector Machines | kernlab | 44812 |

classif.lssvm | Least Squares Support Vector Machine | kernlab | 44812 |

cluster.kkmeans | Kernel K-Means | kernlab | 44812 |

regr.rvm | Relevance Vector Machine | kernlab | 44812 |

classif.cvglmnet | GLM with Lasso or Elasticnet Regularization (Cross Validated Lambda) | glmnet | 41179 |

classif.glmnet | GLM with Lasso or Elasticnet Regularization | glmnet | 41179 |

surv.cvglmnet | GLM with Regularization (Cross Validated Lambda) | glmnet | 41179 |

surv.glmnet | GLM with Regularization | glmnet | 41179 |

classif.cforest | Random forest based on conditional inference trees | party | 36492 |

classif.ctree | Conditional Inference Trees | party | 36492 |

regr.cforest | Random Forest Based on Conditional Inference Trees | party | 36492 |

regr.mob | Model-based Recursive Partitioning Yielding a Tree with Fitted Models Associated with each Terminal Node | party,modeltools | 36492 |

surv.cforest | Random Forest based on Conditional Inference Trees | party,survival | 36492 |

As we are just looking for the packages let’s compress the table a bit further and come to our…

*Here are the first 20 rows of the table:*

package | downloads | learners |
---|---|---|

survival | 153681 | surv.coxph |

e1071 | 102249 | classif.naiveBayes,classif.svm,regr.svm |

MASS | 55852 | classif.lda,classif.qda |

randomForest | 52094 | classif.randomForest,regr.randomForest |

kernlab | 44812 | classif.gausspr,classif.ksvm,classif.lssvm,cluster.kkmeans,regr.gausspr,regr.ksvm,regr.rvm |

glmnet | 41179 | classif.cvglmnet,classif.glmnet,regr.cvglmnet,regr.glmnet,surv.cvglmnet,surv.glmnet |

party | 36492 | classif.cforest,classif.ctree,multilabel.cforest,regr.cforest,regr.ctree |

party,modeltools | 36492 | regr.mob |

party,survival | 36492 | surv.cforest |

fpc | 33664 | cluster.dbscan |

rpart | 28609 | classif.rpart,regr.rpart,surv.rpart |

RWeka | 20583 | classif.IBk,classif.J48,classif.JRip,classif.OneR,classif.PART,cluster.Cobweb,cluster.EM,cluster.FarthestFirst,cluster.SimpleKMeans,cluster.XMeans,regr.IBk |

gbm | 19554 | classif.gbm,regr.gbm,surv.gbm |

nnet | 19538 | classif.multinom,classif.nnet,regr.nnet |

caret,pls | 18106 | classif.plsdaCaret |

pls | 18106 | regr.pcr,regr.plsr |

FNN | 16107 | classif.fnn,regr.fnn |

earth | 15824 | regr.earth |

neuralnet | 15506 | classif.neuralnet |

class | 14493 | classif.knn,classif.lvq1 |

And of course we want to have a small visualization:

This is not really representative of how popular each learner is, as some packages have multiple purposes (e.g. multiple learners).
Furthermore it would be great to have access to the trending list.
Also *most stars at GitHub* gives a better view of what the developers are interested in.
Looking for machine learning packages we see there e.g: xgboost, h2o and tensorflow.

First, let me introduce you to multilabel classification. This is a classification problem, where every instance can have more than one label. Let’s have a look at a typical multilabel dataset (which I, of course, download from the OpenML server):

Here I took the *scene* dataset, where the features represent color information of pictures and the targets could be objects like *beach*, *sunset*, and so on.

As you can see above, one defining property of a multilabel dataset is, that the target variables (which are called *labels*) are binary. If you want to use your own data set, make sure to encode these variables in *logical*, where *TRUE* indicates the relevance of a label.

The basic idea behind many multilabel classification algorithms is to make use of possible correlation between labels. Maybe a learner is very good at predicting label 1, but rather bad at predicting label 2. If label 1 and label 2 are highly correlated, it may be beneficial to predict label 1 first and use this prediction as a feature for predicting label 2.

This approach is the main concept behind the so called *problem transformation methods*. The multilabel problem is transformed into binary classification problems, one for each label. Predicted labels are used as features for predicting other labels.

We implemented the following problem transformation methods:

- Classifier chains
- Nested stacking
- Dependent binary relevance
- Stacking

How these methods are defined, can be read in the mlr tutorial or in more detail in our paper. Enough theory now, let’s apply these methods on our dataset.

First we need to create a multilabel task.

We set a seed, because the classifier chain wrapper uses a random chain order. Next, we train a learner. I chose the classifier chain approach together with a decision tree for the binary classification problems.

Now let’s train and predict on our dataset:

We also implemented common multilabel performance measures. Here is a list with available multilabel performance measures:

Here is how the classifier chains method performed:

Now let’s see if it can be beneficial to use predicted labels as features for other labels. Let us compare the performance of the classifier chains method with the binary relevance method (this method does not use predicted labels as features).

As can be seen here, it could indeed make sense to use more elaborate methods for multilabel classification, since classifier chains beat the binary relevance methods in all of these measures (Note, that hamming loss and subset01 are loss measures!).

Here I’ll show you how to use resampling methods in the multilabel setting. Resampling methods are key for assessing the performance of a learning algorithm. To read more about resampling, see the page on our tutorial.

First, we need to define a resampling strategy. I chose subsampling, which is also called Monte-Carlo cross-validation. The dataset is split into training and test set at a predefined ratio. The learner is trained on the training set, the performance is evaluated with the test set. This whole process is repeated many times and the performance values are averaged. In mlr this is done the following way:

Now we can choose a measure, which shall be resampled. All there is left to do is to run the resampling:

If you followed the mlr tutorial or if you are already familiar with mlr, you most likely saw, that using resampling in the multilabel setting isn’t any different than generally using resampling in mlr. Many methods, which are available in mlr, like preprocessing, tuning or benchmark experiments can also be used for multilabel datasets and the good thing here is: the syntax stays the same!

]]>You can vote for your favorite logo on GitHub by reacting to the logo with a +1.

Thanks to Hannah Atkin for designing the logos!

]]>We recently published **mlrMBO** on CRAN.
As a normal package it normally operates inside of R, but with this post I want to demonstrate how **mlrMBO** can be used to optimize an external application.
At the same time I will highlight some issues you can likely run into.

First of all we need a bash script that we want to optimize.
This tutorial will only run on Unix systems (Linux, OSX etc.) but should also be informative for windows users.
The following code will write a tiny bash script that uses `bc`

to calculate $sin(x_1-1) + (x_1^2 + x_2^2)$ and write the result “hidden” in a sentence (`The result is 12.34!`

) in a `result.txt`

text file.

Now we need a R function that starts the script, reads the result from the text file and returns it.

This function uses `stringi`

and *regular expressions* to match the result within the sentence.
Depending on the output different strategies to read the result make sense.
XML files can usually be accessed with `XML::xmlParse`

, `XML::getNodeSet`

, `XML::xmlAttrs`

etc. using `XPath`

queries.
Sometimes the good old `read.table()`

is also sufficient.
If, for example, the output is written in a file like this:

You can easily use `source()`

like that:

which will return a list with the entries `$value1`

and `$value2`

.

To evaluate the function from within **mlrMBO** it has to be wrapped in **smoof** function.
The smoof function also contains information about the bounds and scales of the domain of the objective function defined in a *ParameterSet*.

If you run this locally, you will see that the console output generated by our shell script directly appears in the R-console. This can be helpful but also annoying.

If a lot of output is generated during a single call of `system()`

it might even crash R.
To avoid that I suggest to redirect the output into a file.
This way no output is lost and the R console does not get flooded.
We can simply achieve that by replacing the `command`

in the function `runScript`

from above with the following code:

Now everything is set so we can proceed with the usual MBO setup:

Also you might not want to bothered having to start *R* and run this script manually so what I would recommend is saving all above as an R-script plus some lines that write the output in a JSON file like this:

Let’s assume we saved all of that above as an R-script under the name `runMBO.R`

(actually it is available as a gist).

Then you can simply run it from the command line:

As an extra the script in the gist also contains a simple handler for command line arguments. In this case you can define the number of optimization iterations and the maximal allowed time in seconds for the optimization. You can also define the seed to make runs reproducible:

If you want to build a more advanced command line interface you might want to have a look at docopt.

To clean up all the files generated by this script you can run:

]]>`batchtools`

.
The data that we will use here is stored on the open machine learning platform openml.org and we can download it together with information on what to do with it in form of a task.

If you have a small project and don’t need to parallelize, you might want to just look at the previous blog post called mlr loves OpenML.

The following packages are needed for this:

Now we download five OpenML-tasks from OpenML:

In a next step we need to create the so called registry. What this basically does is to create a folder with a certain subfolder structure.

Now you should have a new folder in your working directory with the name `parallel_benchmarking_blogpost`

and the following subfolders / files:

```
parallel_benchmarking_blogpost/
├── algorithms
├── exports
├── external
├── jobs
├── logs
├── problems
├── registry.rds
├── results
└── updates
```

In the next step we get to the interesting point. We need to define…

- the
**problems**, which in our case are simply the OpenML tasks we downloaded. - the
**algorithm**, which with mlr and OpenML is quite simply achieved using`makeLearner`

and`runTaskMlr`

. We do not have to save the run results (result of applying the learner to the task), but we can directly upload it to OpenML where the results are automatically evaluated. - the machine learning
**experiment**, i.e. in our case which parameters do we want to set for which learner. As an example here, we will look at the*ctree*algorithm from the*party*package and see whether Bonferroni correction (correction for multiple testing) helps getting better predictions and also we want to check whether we need a tree that has more than two leaf nodes (`stump = FALSE`

) or if a small tree is enough (`stump = TRUE`

).

Now we can simply run our experiment:

While your job is running, you can check the progress using `getStatus()`

.
As soon as `getStatus()`

tells us that all our runs are done, we can collect the results of our experiment from OpenML.
To be able to do this we need to collect the run IDs from the uploaded runs we did during the experiment.
Also we want to add the info of the parameters used (`getJobPars()`

).

With the run ID information we can now grab the evaluations from OpenML and plot for example the parameter settings against the predictive accuracy.

We see that the only data set where a stump is good enough is the pc1 data set. For the madelon data set Bonferroni correction helps. For the others it does not seem to matter. You can check out the results online by going to the task websites (e.g. for task 9976 for the madelon data set go to openml.org/t/9976) or the run websites (e.g. openml.org/r/1852889).

]]>The key features of **mlrMBO** are:

- Global optimization of expensive Black-Box functions.
- Multi-Criteria Optimization.
- Parallelization through multi-point proposals.
- Support for optimization over categorical variables using random forests as a surrogate.

For examples covering different scenarios we have Vignettes that are also available as an online documentation.
For **mlr** users **mlrMBO** is especially interesting for hyperparameter optimization.

**mlrMBO** for **mlr** hyperparameter tuning was already used in an earlier blog post.
Nonetheless we want to provide a small toy example to demonstrate the work flow of **mlrMBO** in this post.

First, we define an objective function that we are going to minimize:

To define the objective function we use `makeSingleObjectiveFunction`

from the neat package **smoof**, which gives us the benefit amongst others to be able to directly visualize the function.
*If you happen to be in need of functions to optimize and benchmark your optimization algorithm I recommend you to have a look at the package!*

Let’s start with the configuration of the optimization:

The optimization has to so start with an initial design.
**mlrMBO** can automatically create one but here we are going to use a randomly sampled LHS design of our own:

The points demonstrate how the initial design already covers the search space but is missing the area of the global minimum.
Before we can start the Bayesian optimization we have to set the surrogate learner to *Kriging*.
Therefore we use an *mlr* regression learner.
In fact, with *mlrMBO* you can use any regression learner integrated in *mlr* as a surrogate allowing for many special optimization applications.

*Note:* **mlrMBO** can automatically determine a good surrogate learner based on the search space defined for the objective function.
For a purely numeric domain it would have chosen *Kriging* as well with some slight modifications to make it a bit more stable against numerical problems that can occur during optimization.

Finally, we can start the optimization run:

We can see that we have found the global optimum of $y = -0.414964$ at $x = (-1.35265,0)$ quite sufficiently.
Let’s have a look at the points mlrMBO evaluated.
Therefore we can use the `OptPath`

which stores all information about all evaluations during the optimization run:

It is interesting to see, that for this run the algorithm first went to the local minimum on the top right in the 6th and 7th iteration but later, thanks to the explorative character of the *Expected Improvement*, found the real global minimum.

That is all good, but how do other optimization strategies perform?

Grid search is seldom a good idea. But especially for hyperparameter tuning it is still used. Probably because it kind of gives you the feeling that you know what is going on and have not left out any important area of the search space. In reality the grid is usually so sparse that it leaves important areas untouched as you can see in this example:

It is no surprise, that the grid search could not cover the search space well enough and we only reach a bad result.

With the random search you could always be lucky but in average the optimum is not reached if smarter optimization strategies work well.

… for stochastic optimization algorithms can only be achieved by repeating the runs.
**mlrMBO** is stochastic as the initial design is generated randomly and the fit of the Kriging surrogate is also not deterministic.
Furthermore we should include other optimization strategies like a genetic algorithm and direct competitors like `rBayesOpt`

.
An extensive benchmark is available in our **mlrMBO** paper.
The examples here are just meant to demonstrate the package.

If you want to contribute to **mlrMBO** we ware always open to suggestions and pull requests on github.
You are also invited to fork the repository and build and extend your own optimizer based on our toolbox.

`mlr`

, can be used to tune a xgboost model with random search in parallel (using 16 cores). The R script scores rank 90 (of 3251) on the Kaggle leaderboard.
- Use good software
- Understand the objective
- Create and select features
- Tune your model
- Validate your model
- Ensemble different models
- Track your progress

Whether you choose R, Python or another language to work on Kaggle, you will most likely need to leverage quite a few packages to follow best practices in machine learning. To save time, you should use ‘software’ that offers a standardized and well-tested interface for the important steps in your workflow:

- Benchmarking different machine learning algorithms (learners)
- Optimizing hyperparameters of learners
- Feature selection, feature engineering and dealing with missing values
- Resampling methods for validation of learner performance
- Parallelizing the points above

Examples of ‘software’ that implement the steps above and more:

- For python: scikit-learn (http://scikit-learn.org/stable/auto_examples).
- For R:
`mlr`

(https://mlr-org.github.io/mlr-tutorial) or`caret`

.

To develop a good understanding of the Kaggle challenge, you should:

- Understand the problem domain:
- Read the description and try to understand the aim of the competition.
- Keep reading the forum and looking into scripts/kernels of others, learn from them!
- Domain knowledge might help you (i.e., read publications about the topic, wikipedia is also ok).
- Use external data if allowed (e.g., google trends, historical weather data).

- Explore the dataset:
- Which features are numerical, categorical, ordinal or time dependent?
- Decide how to handle
*missing values*. Some options:- Impute missing values with the mean, median or with values that are out of range (for numerical features).
- Interpolate missing values if the feature is time dependent.
- Introduce a new category for the missing values or use the mode (for categorical features).

- Do exploratory data analysis (for the lazy: wait until someone else uploads an EDA kernel).
- Insights you learn here will inform the rest of your workflow (creating new features).

Make sure you choose an approach that directly optimizes the measure of interest! Example:

- The
**median**minimizes the mean absolute error**(MAE)**and the**mean**minimizes the mean squared error**(MSE)**. - By default, many regression algorithms predict the expected
**mean**but there are counterparts that predict the expected**median**(e.g., linear regression vs. quantile regression). - For strange measures: Use algorithms where you can implement your own objective function, see e.g.

In many kaggle competitions, finding a “magic feature” can dramatically increase your ranking. Sometimes, better data beats better algorithms! You should therefore try to introduce new features containing valuable information (which can’t be found by the model) or remove noisy features (which can decrease model performance):

- Concat several columns
- Multiply/Add several numerical columns
- Count NAs per row
- Create dummy features from factor columns
- For time series, you could try
- to add the weekday as new feature
- to use rolling mean or median of any other numerical feature
- to add features with a lag…

- Remove noisy features:
*Feature selection / filtering*

Typically you can focus on a single model (e.g. *xgboost*) and tune its hyperparameters for optimal performance.

- Aim: Find the best hyperparameters that, for the given data set, optimize the pre-defined performance measure.
- Problem: Some models have many hyperparameters that can be tuned.
- Possible solutions:
*Grid search or random search*- Advanced procedures such as
*irace*or*mbo (bayesian optimization)*

Good machine learning models not only work on the data they were trained on, but also on unseen (test) data that was not used for training the model. When you use training data to make any kind of decision (like feature or model selection, hyperparameter tuning, …), the data becomes less valuable for generalization to unseen data. So if you just use the public leaderboard for testing, you might overfit to the public leaderboard and lose many ranks once the private leaderboard is revealed. A better approach is to use validation to get an estimate of performane on unseen data:

- First figure out how the Kaggle data was split into train and test data. Your resampling strategy should follow the same method if possible. So if kaggle uses, e.g. a feature for splitting the data, you should not use random samples for creating cross-validation folds.
- Set up a
*resampling procedure*, e.g., cross-validation (CV) to measure your model performance - Improvements on your local CV score should also lead to improvements on the leaderboard.
- If this is not the case, you can try
- several CV folds (e.g., 3-fold, 5-fold, 8-fold)
- repeated CV (e.g., 3 times 3-fold, 3 times 5-fold)
- stratified CV

`mlr`

offers nice*visualizations to benchmark*different algorithms.

After training many different models, you might want to ensemble them into one strong model using one of these methods:

- simple averaging or voting
- finding optimal weights for averaging or voting
- stacking

A kaggle project might get quite messy very quickly, because you might try and prototype many different ideas. To avoid getting lost, make sure to keep track of:

- What preprocessing steps were used to create the data
- What model was used for each step
- What values were predicted in the test file
- What local score did the model achieve
- What public score did the model achieve

If you do not want to use a tool like git, at least make sure you create subfolders for each prototype. This way you can later analyse which models you might want to ensemble or use for your final commits for the competition.

]]>