Multilabel classification has lately gained growing interest in the research community. We implemented several methods, which make use of the standardized mlr framework. Every available binary learner can be used for multilabel problem transformation methods. So if you’re interested in using several multilabel algorithms and want to know how to use them in the mlr framework, then this post is for you!
We at mlr are currently deciding on a new logo, and in the spirit of open-source, we would like to involve the community in the voting process!
You can vote for your favorite logo on GitHub by reacting to the logo with a +1.
Thanks to Hannah Atkin for designing the logos!
Many people who want to apply Bayesian optimization want to use it to optimize an algorithm that is not implemented in R but runs on the command line as a shell script or an executable.
We recently published mlrMBO on CRAN. As a normal package it normally operates inside of R, but with this post I want to demonstrate how mlrMBO can be used to optimize an external application. At the same time I will highlight some issues you can likely run into.
With this post I want to show you how to benchmark several learners (or learners with different parameter settings) using several data sets in a structured and parallelized fashion.
For this we want to use
The data that we will use here is stored on the open machine learning platform openml.org and we can download it together with information on what to do with it in form of a task.
We are happy to finally announce the first release of mlrMBO on cran after a quite long development time. For the theoretical background and a nearly complete overview of mlrMBOs capabilities you can check our paper on mlrMBO that we presubmitted to arxiv.
The key features of mlrMBO are:
- Global optimization of expensive Black-Box functions.
- Multi-Criteria Optimization.
- Parallelization through multi-point proposals.
- Support for optimization over categorical variables using random forests as a surrogate.
For examples covering different scenarios we have Vignettes that are also available as an online documentation. For mlr users mlrMBO is especially interesting for hyperparameter optimization.
Achieving a good score on a Kaggle competition is typically quite difficult.
This blog post outlines 7 tips for beginners to improve their ranking on the Kaggle leaderboards.
For this purpose, I also created a Kernel
for the Kaggle bike sharing competition
that shows how the R package,
mlr, can be used to tune a xgboost model with random search in parallel (using 16 cores). The R script scores rank 90 (of 3251) on the Kaggle leaderboard.
- Use good software
- Understand the objective
- Create and select features
- Tune your model
- Validate your model
- Ensemble different models
- Track your progress
What is OpenML?
Conducting research openly and reproducibly is becoming the gold standard in academic research. Practicing open and reproducible research, however, is hard. OpenML.org (Open Machine Learning) is an online platform that aims at making the part of research involving data and analyses easier. It automatically connects data sets, research tasks, algorithms, analyses and results and allows users to access all components including meta information through a REST API in a machine readable and standardized format. Everyone can see, work with and expand other people’s work in a fully reproducible way.
The useR Tutorial
At useR!2017, we will we will present an R package to interface the OpenML platform and illustrate its usage both as a stand-alone package and in combination with the mlr machine learning package. Furthermore, we show how the OpenML package allows R users to easily search, download and upload machine learning datasets.
When and Where?
In 2017, we are hosting the workshop at LMU Munich. The workshop will run from 6 March to 10 March 2017 (potentially including the sunday before and the saturday at the end), hosted by the Ludwig-Maximilians-University Munich.
- Address: Geschwister-Scholl-Platz 1, Room: M203.
- Start: 6th of March: 10:00 AM.
It is also possible to arrive on Saturday or Sunday, as we already have the rooms and are able to work there. But this is totally optional and the official workshop starts on Monday. Same thing for the Saturday after the workshop.
We are happy to announce that we applied for a another Google Summer of Code project in 2017.
mlr 2.10 is now on CRAN. Please update your package if you haven’t done so in a while.
Here is an overview of the changes:
We are happy to announce that we can finally answer the question on how to cite mlr properly in publications.
Our paper on mlr has been published in the open-access Journal of Machine Learning Research (JMLR) and can be downloaded on the journal home page.
OpenML stands for Open Machine Learning and is an online platform, which aims at supporting collaborative machine learning online. It is an Open Science project that allows its users to share data, code and machine learning experiments.
At the time of writing this blog I am in Eindoven at an OpenML workshop, where developers and scientists meet to work on improving the project. Some of these people are R users and they (we) are developing an R package that communicates with the OpenML platform.
I recently participated in the #TEFDataChallenge a datathon organized by Wayra. The first price was a drone for every team member, which is a pretty awesome price.
So what exactly is a datathon?
Learners use hyperparameters to achieve better performance on particular datasets. When we use a machine learning package to choose the best hyperparmeters, the relationship between changing the hyperparameter and performance might not be obvious. mlr provides several new implementations to better understand what happens when we tune hyperparameters and to help us optimize our choice of hyperparameters.
The mlr developer team is quite international: Germany, USA, Canada. The time difference between these countries sometimes makes it hard to communicate and develop new features.
The idea for this workshop or sprint was to have the possibility to talk about the project status, future and structure, exterminate imperishable bugs and start developing some fancy features.
Learners use features to make predictions but how those features are used is often not apparent.
mlr can estimate the dependence of a learned function on a subset of the feature space using
There are already some benchmarking studies about different classification algorithms out there. The probably most well known and most extensive one is the Do we Need Hundreds of Classifers to Solve Real World Classication Problems? paper. They use different software and also different tuning processes to compare 179 learners on more than 121 datasets, mainly from the UCI site. They exclude different datasets, because their dimension (number of observations or number of features) are too high, they are not in a proper format or because of other reasons. There are also summarized some criticism about the representability of the datasets and the generability of benchmarking results. It remains a bit unclear if their tuning process is done also on the test data or only on the training data (page 3154). They reported the random forest algorithms to be the best one (in general) for multiclass classification datasets and the support vector machine (svm) the second best one. On binary class classification tasks neural networks also perform competitively. They recommend the R library caret for choosing a classifier.
In this post I want to shortly introduce you to the great visualization possibilities of
Within the last months a lot of work has been put into that field.
This post is not a tutorial but more a demonstration of how little code you have to write with
mlr to get some nice plots showing the prediction behaviors for different learners.