- visualize the state of the surrogate model,
- obtain the suggested parameter configuration for the next iteration and
- update the surrogate model with arbitrary evaluations.
In the following we will demonstrate this feature on a simple example.
In the following we will demonstrate this feature on a simple example.
On the weekend of November 17. - 19. five brave data-knights from team “Rtus and the knights of the data.table” took on the challenge to compete in a datathon organized by Munich Re in its Munich-based innovation lab. Team Rtus was formed in April this year by a bunch of statistics students from LMU with the purpose to prove their data-skills in competitions with other teams from various backgrounds. The datathon was centered around the topic “the effects of climate change and hurricane events on modern reinsurance business” and after two days of very intensive battles with databases, web-crawlers and advanced machine learning modelling (using the famous R-library mlr), they managed to prevail against strong professional competitors and won the best overall price. After victories at the Datafest at Mannheim University and the Telefonica data challenge earlier this year, this was the last step to a hattrick for the core members of team Rtus.
The field of Machine Learning has grown tremendously over the last years, and is a key component of data-driven science. Data analysis algorithms are being invented and used every day, but their results and experiments are published almost exclusively in journals or separated repositories. However, data by itself has no value. It’s the ever-changing ecosystem surrounding data that gives it meaning.
OpenML is a networked science platform that aims to connect and organize all this knowledge online, linking data, algorithms, results and people into a coherent whole so that scientists and practitioners can easy build on prior work and collaborate in real time online.
Hyperparameter tuning with mlr is rich in options as they are multiple tuning methods:
shinyMlr is a web application, built with the R-package “shiny” that provides a user interface for mlr. By wrapping the main functionalities of mlr into our app, as well as implementing additional features for data visualisation and data preprocessing, we built a widely usable application for your day to day machine learning tasks, which we would like to present to you today.
For the development of mlr as well as for an “machine learning expert” it can be handy to know what are the most popular learners used. Not necessarily to see, what are the top notch performing methods but to see what is used “out there” in the real world. Thanks to the nice little package cranlogs from metacran you can at least get a slight estimate as I will show in the following…
Multilabel classification has lately gained growing interest in the research community. We implemented several methods, which make use of the standardized mlr framework. Every available binary learner can be used for multilabel problem transformation methods. So if you’re interested in using several multilabel algorithms and want to know how to use them in the mlr framework, then this post is for you!
We at mlr are currently deciding on a new logo, and in the spirit of open-source, we would like to involve the community in the voting process!
You can vote for your favorite logo on GitHub by reacting to the logo with a +1.
Thanks to Hannah Atkin for designing the logos!
Many people who want to apply Bayesian optimization want to use it to optimize an algorithm that is not implemented in R but runs on the command line as a shell script or an executable.
We recently published mlrMBO on CRAN. As a normal package it normally operates inside of R, but with this post I want to demonstrate how mlrMBO can be used to optimize an external application. At the same time I will highlight some issues you can likely run into.
With this post I want to show you how to benchmark several learners (or learners with different parameter settings) using several data sets in a structured and parallelized fashion.
For this we want to use
The data that we will use here is stored on the open machine learning platform openml.org and we can download it together with information on what to do with it in form of a task.
We are happy to finally announce the first release of mlrMBO on cran after a quite long development time. For the theoretical background and a nearly complete overview of mlrMBOs capabilities you can check our paper on mlrMBO that we presubmitted to arxiv.
The key features of mlrMBO are:
For examples covering different scenarios we have Vignettes that are also available as an online documentation. For mlr users mlrMBO is especially interesting for hyperparameter optimization.
Achieving a good score on a Kaggle competition is typically quite difficult.
This blog post outlines 7 tips for beginners to improve their ranking on the Kaggle leaderboards.
For this purpose, I also created a Kernel
for the Kaggle bike sharing competition
that shows how the R package,
mlr, can be used to tune a xgboost model with random search in parallel (using 16 cores). The R script scores rank 90 (of 3251) on the Kaggle leaderboard.
Conducting research openly and reproducibly is becoming the gold standard in academic research. Practicing open and reproducible research, however, is hard. OpenML.org (Open Machine Learning) is an online platform that aims at making the part of research involving data and analyses easier. It automatically connects data sets, research tasks, algorithms, analyses and results and allows users to access all components including meta information through a REST API in a machine readable and standardized format. Everyone can see, work with and expand other people’s work in a fully reproducible way.
At useR!2017, we will we will present an R package to interface the OpenML platform and illustrate its usage both as a stand-alone package and in combination with the mlr machine learning package. Furthermore, we show how the OpenML package allows R users to easily search, download and upload machine learning datasets.
In 2017, we are hosting the workshop at LMU Munich. The workshop will run from 6 March to 10 March 2017 (potentially including the sunday before and the saturday at the end), hosted by the Ludwig-Maximilians-University Munich.
It is also possible to arrive on Saturday or Sunday, as we already have the rooms and are able to work there. But this is totally optional and the official workshop starts on Monday. Same thing for the Saturday after the workshop.
We are happy to announce that we applied for a another Google Summer of Code project in 2017.
mlr 2.10 is now on CRAN. Please update your package if you haven’t done so in a while.
Here is an overview of the changes:
We are happy to announce that we can finally answer the question on how to cite mlr properly in publications.
Our paper on mlr has been published in the open-access Journal of Machine Learning Research (JMLR) and can be downloaded on the journal home page.
OpenML stands for Open Machine Learning and is an online platform, which aims at supporting collaborative machine learning online. It is an Open Science project that allows its users to share data, code and machine learning experiments.
At the time of writing this blog I am in Eindoven at an OpenML workshop, where developers and scientists meet to work on improving the project. Some of these people are R users and they (we) are developing an R package that communicates with the OpenML platform.
I recently participated in the #TEFDataChallenge a datathon organized by Wayra. The first price was a drone for every team member, which is a pretty awesome price.
So what exactly is a datathon?
Learners use hyperparameters to achieve better performance on particular datasets. When we use a machine learning package to choose the best hyperparmeters, the relationship between changing the hyperparameter and performance might not be obvious. mlr provides several new implementations to better understand what happens when we tune hyperparameters and to help us optimize our choice of hyperparameters.
The mlr developer team is quite international: Germany, USA, Canada. The time difference between these countries sometimes makes it hard to communicate and develop new features.
The idea for this workshop or sprint was to have the possibility to talk about the project status, future and structure, exterminate imperishable bugs and start developing some fancy features.
Learners use features to make predictions but how those features are used is often not apparent.
mlr can estimate the dependence of a learned function on a subset of the feature space using
There are already some benchmarking studies about different classification algorithms out there. The probably most well known and most extensive one is the Do we Need Hundreds of Classifers to Solve Real World Classication Problems? paper. They use different software and also different tuning processes to compare 179 learners on more than 121 datasets, mainly from the UCI site. They exclude different datasets, because their dimension (number of observations or number of features) are too high, they are not in a proper format or because of other reasons. There are also summarized some criticism about the representability of the datasets and the generability of benchmarking results. It remains a bit unclear if their tuning process is done also on the test data or only on the training data (page 3154). They reported the random forest algorithms to be the best one (in general) for multiclass classification datasets and the support vector machine (svm) the second best one. On binary class classification tasks neural networks also perform competitively. They recommend the R library caret for choosing a classifier.
In this post I want to shortly introduce you to the great visualization possibilities of
Within the last months a lot of work has been put into that field.
This post is not a tutorial but more a demonstration of how little code you have to write with
mlr to get some nice plots showing the prediction behaviors for different learners.