This July we had the great honor to present mlr and its ecosystem at the Why R 2018 Conference in Wroclaw in Poland. You can find the slides here. We want to thank the organizers for inviting us, providing us with great food and coffee and also many thanks to all participants for showing great interest in mlr.
Machine learning models repeatedly outperform interpretable, parametric models like the linear regression model. The gains in performance have a price: The models operate as black boxes which are not interpretable.
Fortunately, there are many methods that can make machine learning models interpretable.
The R package
iml provides tools for analysing any black box machine learning model:
mlr: Machine Learning in R package provides a generic, object-oriented and extensible framework for classification, regression, survival analysis and clustering for the statistical programming language R.
The package targets practitioners who want to quickly apply machine learning algorithms, as well as researchers who want to implement, benchmark, and compare their new methods in a structured environment.
We are happy to announce that we now offer training courses specialized on
- The Munich R Courses already offer the three-day course “Machine Learning and Data Mining in R”.
- visualize the state of the surrogate model,
- obtain the suggested parameter configuration for the next iteration and
- update the surrogate model with arbitrary evaluations.
In the following we will demonstrate this feature on a simple example.
On the weekend of November 17. - 19. five brave data-knights from team “Rtus and the knights of the data.table” took on the challenge to compete in a datathon organized by Munich Re in its Munich-based innovation lab. Team Rtus was formed in April this year by a bunch of statistics students from LMU with the purpose to prove their data-skills in competitions with other teams from various backgrounds. The datathon was centered around the topic “the effects of climate change and hurricane events on modern reinsurance business” and after two days of very intensive battles with databases, web-crawlers and advanced machine learning modelling (using the famous R-library mlr), they managed to prevail against strong professional competitors and won the best overall price. After victories at the Datafest at Mannheim University and the Telefonica data challenge earlier this year, this was the last step to a hattrick for the core members of team Rtus.
What is OpenML?
The field of Machine Learning has grown tremendously over the last years, and is a key component of data-driven science. Data analysis algorithms are being invented and used every day, but their results and experiments are published almost exclusively in journals or separated repositories. However, data by itself has no value. It’s the ever-changing ecosystem surrounding data that gives it meaning.
OpenML is a networked science platform that aims to connect and organize all this knowledge online, linking data, algorithms, results and people into a coherent whole so that scientists and practitioners can easy build on prior work and collaborate in real time online.
Hyperparameter tuning with mlr is rich in options as they are multiple tuning methods:
- Simple Random Search
- Grid Search
- Iterated F-Racing (via irace)
- Sequential Model-Based Optimization (via mlrMBO)
shinyMlr is a web application, built with the R-package “shiny” that provides a user interface for mlr. By wrapping the main functionalities of mlr into our app, as well as implementing additional features for data visualisation and data preprocessing, we built a widely usable application for your day to day machine learning tasks, which we would like to present to you today.
For the development of mlr as well as for an “machine learning expert” it can be handy to know what are the most popular learners used. Not necessarily to see, what are the top notch performing methods but to see what is used “out there” in the real world. Thanks to the nice little package cranlogs from metacran you can at least get a slight estimate as I will show in the following…
Multilabel classification has lately gained growing interest in the research community. We implemented several methods, which make use of the standardized mlr framework. Every available binary learner can be used for multilabel problem transformation methods. So if you’re interested in using several multilabel algorithms and want to know how to use them in the mlr framework, then this post is for you!
We at mlr are currently deciding on a new logo, and in the spirit of open-source, we would like to involve the community in the voting process!
You can vote for your favorite logo on GitHub by reacting to the logo with a +1.
Thanks to Hannah Atkin for designing the logos!
Many people who want to apply Bayesian optimization want to use it to optimize an algorithm that is not implemented in R but runs on the command line as a shell script or an executable.
We recently published mlrMBO on CRAN. As a normal package it normally operates inside of R, but with this post I want to demonstrate how mlrMBO can be used to optimize an external application. At the same time I will highlight some issues you can likely run into.
With this post I want to show you how to benchmark several learners (or learners with different parameter settings) using several data sets in a structured and parallelized fashion.
For this we want to use
The data that we will use here is stored on the open machine learning platform openml.org and we can download it together with information on what to do with it in form of a task.
We are happy to finally announce the first release of mlrMBO on cran after a quite long development time. For the theoretical background and a nearly complete overview of mlrMBOs capabilities you can check our paper on mlrMBO that we presubmitted to arxiv.
The key features of mlrMBO are:
- Global optimization of expensive Black-Box functions.
- Multi-Criteria Optimization.
- Parallelization through multi-point proposals.
- Support for optimization over categorical variables using random forests as a surrogate.
For examples covering different scenarios we have Vignettes that are also available as an online documentation. For mlr users mlrMBO is especially interesting for hyperparameter optimization.
Achieving a good score on a Kaggle competition is typically quite difficult.
This blog post outlines 7 tips for beginners to improve their ranking on the Kaggle leaderboards.
For this purpose, I also created a Kernel
for the Kaggle bike sharing competition
that shows how the R package,
mlr, can be used to tune a xgboost model with random search in parallel (using 16 cores). The R script scores rank 90 (of 3251) on the Kaggle leaderboard.
- Use good software
- Understand the objective
- Create and select features
- Tune your model
- Validate your model
- Ensemble different models
- Track your progress
What is OpenML?
Conducting research openly and reproducibly is becoming the gold standard in academic research. Practicing open and reproducible research, however, is hard. OpenML.org (Open Machine Learning) is an online platform that aims at making the part of research involving data and analyses easier. It automatically connects data sets, research tasks, algorithms, analyses and results and allows users to access all components including meta information through a REST API in a machine readable and standardized format. Everyone can see, work with and expand other people’s work in a fully reproducible way.
The useR Tutorial
At useR!2017, we will we will present an R package to interface the OpenML platform and illustrate its usage both as a stand-alone package and in combination with the mlr machine learning package. Furthermore, we show how the OpenML package allows R users to easily search, download and upload machine learning datasets.
When and Where?
In 2017, we are hosting the workshop at LMU Munich. The workshop will run from 6 March to 10 March 2017 (potentially including the sunday before and the saturday at the end), hosted by the Ludwig-Maximilians-University Munich.
- Address: Geschwister-Scholl-Platz 1, Room: M203.
- Start: 6th of March: 10:00 AM.
It is also possible to arrive on Saturday or Sunday, as we already have the rooms and are able to work there. But this is totally optional and the official workshop starts on Monday. Same thing for the Saturday after the workshop.
We are happy to announce that we applied for a another Google Summer of Code project in 2017.
mlr 2.10 is now on CRAN. Please update your package if you haven’t done so in a while.
Here is an overview of the changes:
We are happy to announce that we can finally answer the question on how to cite mlr properly in publications.
Our paper on mlr has been published in the open-access Journal of Machine Learning Research (JMLR) and can be downloaded on the journal home page.
OpenML stands for Open Machine Learning and is an online platform, which aims at supporting collaborative machine learning online. It is an Open Science project that allows its users to share data, code and machine learning experiments.
At the time of writing this blog I am in Eindoven at an OpenML workshop, where developers and scientists meet to work on improving the project. Some of these people are R users and they (we) are developing an R package that communicates with the OpenML platform.
I recently participated in the #TEFDataChallenge a datathon organized by Wayra. The first price was a drone for every team member, which is a pretty awesome price.
So what exactly is a datathon?
Learners use hyperparameters to achieve better performance on particular datasets. When we use a machine learning package to choose the best hyperparmeters, the relationship between changing the hyperparameter and performance might not be obvious. mlr provides several new implementations to better understand what happens when we tune hyperparameters and to help us optimize our choice of hyperparameters.
The mlr developer team is quite international: Germany, USA, Canada. The time difference between these countries sometimes makes it hard to communicate and develop new features.
The idea for this workshop or sprint was to have the possibility to talk about the project status, future and structure, exterminate imperishable bugs and start developing some fancy features.
Learners use features to make predictions but how those features are used is often not apparent.
mlr can estimate the dependence of a learned function on a subset of the feature space using
There are already some benchmarking studies about different classification algorithms out there. The probably most well known and most extensive one is the Do we Need Hundreds of Classifers to Solve Real World Classication Problems? paper. They use different software and also different tuning processes to compare 179 learners on more than 121 datasets, mainly from the UCI site. They exclude different datasets, because their dimension (number of observations or number of features) are too high, they are not in a proper format or because of other reasons. There are also summarized some criticism about the representability of the datasets and the generability of benchmarking results. It remains a bit unclear if their tuning process is done also on the test data or only on the training data (page 3154). They reported the random forest algorithms to be the best one (in general) for multiclass classification datasets and the support vector machine (svm) the second best one. On binary class classification tasks neural networks also perform competitively. They recommend the R library caret for choosing a classifier.
In this post I want to shortly introduce you to the great visualization possibilities of
Within the last months a lot of work has been put into that field.
This post is not a tutorial but more a demonstration of how little code you have to write with
mlr to get some nice plots showing the prediction behaviors for different learners.