Fortunately, there are many methods that can make machine learning models interpretable.
The R package iml
provides tools for analysing any black box machine learning model:
Feature importance: Which were the most important features?
Feature effects: How does a feature influence the prediction? (Partial dependence plots and individual conditional expectation curves)
Explanations for single predictions: How did the feature values of a single data point affect its prediction? (LIME and Shapley value)
Surrogate trees: Can we approximate the underlying black box model with a short decision tree?
The iml package works for any classification and regression machine learning model: random forests, linear models, neural networks, xgboost, etc.
This blog post shows you how to use the iml
package to analyse machine learning models.
While the mlr
package makes it super easy to train machine learning models, the iml
package makes it easy to extract insights about the learned black box machine learning models.
If you want to learn more about the technical details of all the methods, read the Interpretable Machine Learning book.
Let’s explore the iml
-toolbox for interpreting an mlr
machine learning model with concrete examples!
We’ll use the MASS::Boston
dataset to demonstrate the abilities of the iml package. This dataset contains median house values from Boston neighbourhoods.
First we train a randomForest to predict the Boston median housing value:
We create a Predictor
object, that holds the model and the data. The iml
package uses R6 classes: New objects can be created by calling Predictor$new()
.
Predictor
works best with mlr models (WrappedModel
-class), but it is also possible to use models from other packages.
We can measure how important each feature was for the predictions with FeatureImp
. The feature importance measure works by shuffling each feature and measuring how much the performance drops. For this regression task we choose to measure the loss in performance with the mean absolute error (‘mae’); another choice would be the mean squared error (‘mse’).
Once we created a new object of FeatureImp
, the importance is automatically computed.
We can call the plot()
function of the object or look at the results in a data.frame.
Besides learning which features were important, we are interested in how the features influence the predicted outcome. The Partial
class implements partial dependence plots and individual conditional expectation curves. Each individual line represents the predictions (y-axis) for one data point when we change one of the features (e.g. ‘lstat’ on the x-axis). The highlighted line is the point-wise average of the individual lines and equals the partial dependence plot. The marks on the x-axis indicates the distribution of the ‘lstat’ feature, showing how relevant a region is for interpretation (little or no points mean that we should not over-interpret this region).
If we want to compute the partial dependence curves for another feature, we can simply reset the feature. Also, we can center the curves at a feature value of our choice, which makes it easier to see the trend of the curves:
Another way to make the models more interpretable is to replace the black box with a simpler model - a decision tree. We take the predictions of the black box model (in our case the random forest) and train a decision tree on the original features and the predicted outcome. The plot shows the terminal nodes of the fitted tree. The maxdepth parameter controls how deep the tree can grow and therefore how interpretable it is.
We can use the tree to make predictions:
Global surrogate model can improve the understanding of the global model behaviour.
We can also fit a model locally to understand an individual prediction better. The local model fitted by LocalModel
is a linear regression model and the data points are weighted by how close they are to the data point for wich we want to explain the prediction.
An alternative for explaining individual predictions is a method from coalitional game theory named Shapley value. Assume that for one data point, the feature values play a game together, in which they get the prediction as a payout. The Shapley value tells us how to fairly distribute the payout among the feature values.
We can reuse the object to explain other data points:
The results in data.frame form can be extracted like this:
]]>mlr
: Machine Learning in R package provides a generic, object-oriented and extensible framework for classification, regression, survival analysis and clustering for the statistical programming language R.
The package targets practitioners who want to quickly apply machine learning algorithms, as well as researchers who want to implement, benchmark, and compare their new methods in a structured environment.
We are happy to announce that we now offer training courses specialized on mlr
:
The Munich R Courses already offer the three-day course “Machine Learning and Data Mining in R”.
The course includes a basic introduction to theoretical concepts of machine learning and especially focuses on the mlr
package.
The next course starts on May 2nd, 2018 and will be held in German (see here).
The Professional Certificate Program “Data Science” also includes a short introduction to the mlr
package. It is a brand new extra-occupational 10-day training at the University of Munich (LMU).
The certificate program starts in May 2018 and the application deadline is March 15th, 2018.
You can request an inhouse R course in English or German via our contact form.
If you offer R or Data Science courses and want to include a course on mlr
: Machine Learning in R into your course program, feel free to contact us at rkurse@stat.uni-muenchen.de.
In the following we will demonstrate this feature on a simple example.
First we need an objective function we want to optimize. For this post a simple function will suffice but note that this function could also be an external process as in this mode mlrMBO does not need to access the objective function as you will only have to pass the results of the function to mlrMBO.
However we still need to define the our search space. In this case we look for a real valued value between -3 and 3. For more hints about how to define ParamSets you can look here or in the help of ParamHelpers.
We also need some initial evaluations to start the optimization.
The design has to be passed as a data.frame
with one column for each dimension of the search space and one column y
for the outcomes of the objective function.
With these values we can initialize our sequential MBO object.
The opt.state
now contains all necessary information for the optimization.
We can even plot it to see how the Gaussian process models the objective function.
In the first panel the expected improvement ($EI = E(y_{min}-\hat{y})$) (see Jones et.al.) is plotted over the search space. The maximum of the EI indicates the point that we should evaluate next. The second panel shows the mean prediction of the surrogate model, which is the Gaussian regression model aka Kriging in this example. The third panel shows the uncertainty prediction of the surrogate. We can see, that the EI is high at points, where the mean prediction is low and/or the uncertainty is high.
To obtain the specific configuration suggested by mlrMBO for the next evaluation of the objective we can run:
We will execute our objective function with the suggested value for x
and feed it back to mlrMBO:
The nice thing about the human-in-the-loop mode is, that you don’t have to stick to the suggestion. In other words we can feed the model with values without receiving a proposal. Let’s assume we have an expert who tells us to evaluate the values $x=-1$ and $x=1$ we can easily do so:
We can also automate the process easily:
Note: We suggest to use the normal mlrMBO if you are only doing this as mlrMBO has more advanced logging, termination and handling of errors etc.
Let’s see how the surrogate models the true objective function after having seen seven configurations:
You can convert the opt.state
object from this run to a normal mlrMBO result object like this:
Note: You can always run the human-in-the-loop MBO on res$final.opt.state
.
For the curious, let’s see how our original function actually looks like and which points we evaluated during our optimization:
We can see, that we got pretty close to the global optimum and that the surrogate in the previous plot models the objective quite accurate.
For more in-depth information look at the Vignette for Human-in-the-loop MBO and check out the other topics of our mlrMBO page.
]]>Their way to success was the implementation of an interactive web-app that could serve as a decision support system for underwriting and tariffing units at Munich Re that are dealing with natural catastrophies. Munich Re provided the participants with databases on claims exposure in the Florida-bay, footprints of past hurricanes and tons of data on climate variable measurements over the past decades. One of the core tasks of the challenge was to define and calculate the maximum foreseeable loss and the probability of such a worst-case event to take place.
To answer the first question, they created a web app that calculates the expected loss of a hurricane in a certain region. To give the decision makers the opportunity to include their expert domain knowledge, they could interact with the app and set the shape and the location of the hurricane, which was modelled as a spatial Gaussian process. This is depicted in the first screenshot. Due to a NDA the true figures and descriptions in the app were altered.
The team recognised the existence of several critical nuclear power plants in this area. The shocking event of Fukushima in 2011 showed the disastrous effects that storm surges, a side effect of hurricanes, can have in combination with nuclear power plants. To account for this, team Rtus implemented the “Nuclear Power Plant” mode in the app. The function of this NPP-mode is shown in this figure:
In a next step, the team tried to provide evidence for the plausibility of such a worst-case event. The following image, based on the footprints of past hurricanes, shows that there were indeed hurricanes crossing the locations of the nuclear power plants:
To answer the second part of the question, they also created a simulation of several weather variables to forecast the probability of such heavy category 5 hurricane events. One rule of reliable statistic modelling is the inclusion of uncertainty measures in any prediction, which was integrated via the prediction intervals. Also, the user of the simulation is able to increase or decrease the estimated temperature trend that underlies the model. This screenshot illustrates the simulation app:
The 36 hours of intensive hacking, discussions, reiterations and fantastic team work combined with the consumption of estimated 19.68 litres of Club Mate were finally rewarded with the first place ranking and a Microsoft Surface Pro for each of the knights. “We will apply this augmentation of our weapon arsenal directly in the next data battle”, one of the knights proudly stated during the awards ceremony.
This portrait shows the tired but happy data knights (from left to right: Niklas Klein, Moritz Herrmann, Jann Goschenhofer, Markus Dumke and Daniel Schalk):
]]>The field of Machine Learning has grown tremendously over the last years, and is a key component of data-driven science. Data analysis algorithms are being invented and used every day, but their results and experiments are published almost exclusively in journals or separated repositories. However, data by itself has no value. It’s the ever-changing ecosystem surrounding data that gives it meaning.
OpenML is a networked science platform that aims to connect and organize all this knowledge online, linking data, algorithms, results and people into a coherent whole so that scientists and practitioners can easy build on prior work and collaborate in real time online.
OpenML has an online interface on openml.org, and is integrated in the most popular machine learning tools and statistical environments such as R, Python, WEKA, MOA and RapidMiner. This allows researchers and students to easily import and export data from these tools and share them with others online, fully integrated into the context of the state of the art. On OpenML, researchers can connect to each other, start projects, and build on the results of others. It automatically keeps track of how often shared work is reused so that researchers can follow the wider impact of their work and become more visible.
The OpenML workshop is organized as a hackathon, an event where participants from many scientific domains present their goals and ideas, and then work on them in small teams for many hours or days at a time. Participants bring their laptops, learn how to use OpenML in tutorials, and build upon that to create something great to push their research forward. The complete OpenML development team will be available to get them started, answer questions, and implement new features on the fly.
The next OpenML Workshop will run from 9 October to 13 Oktober 2017, see also here.
]]>Also the search space is easily definable and customizable for each of the 60+ learners of mlr using the ParamSets from the ParamHelpers Package.
The only drawback and shortcoming of mlr in comparison to caret in this regard is that mlr itself does not have defaults for the search spaces. This is where mlrHyperopt comes into play.
mlrHyperopt offers
Tuning can be done in one line relying on the defaults. The default will automatically minimize the missclassification rate.
We can find out what hyperopt
did by inspecting the res
object.
Depending on the parameter space mlrHyperopt will automatically decide for a suitable tuning method:
As the search space defined in the ParamSet is only numeric, sequential Bayesian optimization was chosen. We can look into the evaluated parameter configurations and we can visualize the optimization run.
The upper left plot shows the distribution of the tried settings in the search space and contour lines indicate where regions of good configurations are located. The lower right plot shows the value of the objective (the miss-classification rate) and how it decreases over the time. This also shows nicely that wrong settings can lead to bad results.
If you just want to use mlrHyperopt to access the default parameter search spaces from the Often you don’t want to rely on the default procedures of mlrHyperopt and just incorporate it into your mlr-workflow. Here is one example how you can use the default search spaces for an easy benchmark:
As we can see we were able to improve the performance of xgboost and the nnet without any additional knowledge on what parameters we should tune. Especially for nnet improved performance is noticable.
Some recommended additional reads
Stefan and me started working on this project late summer 2016 as part of a practical course we attended for our Master’s program. We enjoyed the work on this project and will continue to maintain and extend our app in the future. However, after almost one year of work our application got a versatile tool and it is time to present it to a broader audience. To introduce you to the workflow and main features of our app, we uploaded a video series to our youtube channel. The videos are little tutorials that illustrate the workflow in form of a use case: We used the titanic data set from the kaggle competition as example data to show you step by step how it can be analyzed with our application.
The first video gives a small introduction and shows you how data can be imported:
In the next tutorial you will learn how to visualise your data and preprocess it:
The third and fourth screencasts show you how to create your task and how to construct and modify our built-in learning algorithms:
The fifth part of our tutorials shows you how to tune your learners to find suitable parameter settings for your given training set:
The sixth video gives you detailed information on how to actually train models on your task, predict on new data and plot model diagnostic and prediction plots:
The seventh video runs a benchmark experiment, to show you how to compare different learners in our application:
The last tutorial briefly demonstrates how to render an interactive report from your analysis done with our app:
I hope you enjoyed watching the videos and learned how to make use of our application. If you like working with our app please leave us a star and follow us on github
]]>First we need to install the cranlogs
package using devtools
:
Now let’s load all the packages we will need:
Do obtain a neat table of all available learners in mlr we can call listLearners()
.
This table also contains a column with the needed packages for each learner separated with a ,
.
Note: You might get some warnings here because you likely did not install all packages that mlr suggests – which is totally fine.
Now we can obtain the download counts from the rstudio cran mirror, i.e. from the last month.
We use data.table
to easily sum up the download counts of each day.
As some learners need multiple packages we will use the download count of the package with the least downloads.
Let’s put these numbers in our table:
Here are the first 5 rows of the table:
class | name | package | downloads |
---|---|---|---|
surv.coxph | Cox Proportional Hazard Model | survival | 153681 |
classif.naiveBayes | Naive Bayes | e1071 | 102249 |
classif.svm | Support Vector Machines (libsvm) | e1071 | 102249 |
regr.svm | Support Vector Machines (libsvm) | e1071 | 102249 |
classif.lda | Linear Discriminant Analysis | MASS | 55852 |
Now let’s get rid of the duplicates introduced by the distinction of the type classif, regr and we already have our…
The top 20 according to the rstudio cran mirror:
class | name | package | downloads |
---|---|---|---|
surv.coxph | Cox Proportional Hazard Model | survival | 153681 |
classif.naiveBayes | Naive Bayes | e1071 | 102249 |
classif.svm | Support Vector Machines (libsvm) | e1071 | 102249 |
classif.lda | Linear Discriminant Analysis | MASS | 55852 |
classif.qda | Quadratic Discriminant Analysis | MASS | 55852 |
classif.randomForest | Random Forest | randomForest | 52094 |
classif.gausspr | Gaussian Processes | kernlab | 44812 |
classif.ksvm | Support Vector Machines | kernlab | 44812 |
classif.lssvm | Least Squares Support Vector Machine | kernlab | 44812 |
cluster.kkmeans | Kernel K-Means | kernlab | 44812 |
regr.rvm | Relevance Vector Machine | kernlab | 44812 |
classif.cvglmnet | GLM with Lasso or Elasticnet Regularization (Cross Validated Lambda) | glmnet | 41179 |
classif.glmnet | GLM with Lasso or Elasticnet Regularization | glmnet | 41179 |
surv.cvglmnet | GLM with Regularization (Cross Validated Lambda) | glmnet | 41179 |
surv.glmnet | GLM with Regularization | glmnet | 41179 |
classif.cforest | Random forest based on conditional inference trees | party | 36492 |
classif.ctree | Conditional Inference Trees | party | 36492 |
regr.cforest | Random Forest Based on Conditional Inference Trees | party | 36492 |
regr.mob | Model-based Recursive Partitioning Yielding a Tree with Fitted Models Associated with each Terminal Node | party,modeltools | 36492 |
surv.cforest | Random Forest based on Conditional Inference Trees | party,survival | 36492 |
As we are just looking for the packages let’s compress the table a bit further and come to our…
Here are the first 20 rows of the table:
package | downloads | learners |
---|---|---|
survival | 153681 | surv.coxph |
e1071 | 102249 | classif.naiveBayes,classif.svm,regr.svm |
MASS | 55852 | classif.lda,classif.qda |
randomForest | 52094 | classif.randomForest,regr.randomForest |
kernlab | 44812 | classif.gausspr,classif.ksvm,classif.lssvm,cluster.kkmeans,regr.gausspr,regr.ksvm,regr.rvm |
glmnet | 41179 | classif.cvglmnet,classif.glmnet,regr.cvglmnet,regr.glmnet,surv.cvglmnet,surv.glmnet |
party | 36492 | classif.cforest,classif.ctree,multilabel.cforest,regr.cforest,regr.ctree |
party,modeltools | 36492 | regr.mob |
party,survival | 36492 | surv.cforest |
fpc | 33664 | cluster.dbscan |
rpart | 28609 | classif.rpart,regr.rpart,surv.rpart |
RWeka | 20583 | classif.IBk,classif.J48,classif.JRip,classif.OneR,classif.PART,cluster.Cobweb,cluster.EM,cluster.FarthestFirst,cluster.SimpleKMeans,cluster.XMeans,regr.IBk |
gbm | 19554 | classif.gbm,regr.gbm,surv.gbm |
nnet | 19538 | classif.multinom,classif.nnet,regr.nnet |
caret,pls | 18106 | classif.plsdaCaret |
pls | 18106 | regr.pcr,regr.plsr |
FNN | 16107 | classif.fnn,regr.fnn |
earth | 15824 | regr.earth |
neuralnet | 15506 | classif.neuralnet |
class | 14493 | classif.knn,classif.lvq1 |
And of course we want to have a small visualization:
This is not really representative of how popular each learner is, as some packages have multiple purposes (e.g. multiple learners). Furthermore it would be great to have access to the trending list. Also most stars at GitHub gives a better view of what the developers are interested in. Looking for machine learning packages we see there e.g: xgboost, h2o and tensorflow.
]]>First, let me introduce you to multilabel classification. This is a classification problem, where every instance can have more than one label. Let’s have a look at a typical multilabel dataset (which I, of course, download from the OpenML server):
Here I took the scene dataset, where the features represent color information of pictures and the targets could be objects like beach, sunset, and so on.
As you can see above, one defining property of a multilabel dataset is, that the target variables (which are called labels) are binary. If you want to use your own data set, make sure to encode these variables in logical, where TRUE indicates the relevance of a label.
The basic idea behind many multilabel classification algorithms is to make use of possible correlation between labels. Maybe a learner is very good at predicting label 1, but rather bad at predicting label 2. If label 1 and label 2 are highly correlated, it may be beneficial to predict label 1 first and use this prediction as a feature for predicting label 2.
This approach is the main concept behind the so called problem transformation methods. The multilabel problem is transformed into binary classification problems, one for each label. Predicted labels are used as features for predicting other labels.
We implemented the following problem transformation methods:
How these methods are defined, can be read in the mlr tutorial or in more detail in our paper. Enough theory now, let’s apply these methods on our dataset.
First we need to create a multilabel task.
We set a seed, because the classifier chain wrapper uses a random chain order. Next, we train a learner. I chose the classifier chain approach together with a decision tree for the binary classification problems.
Now let’s train and predict on our dataset:
We also implemented common multilabel performance measures. Here is a list with available multilabel performance measures:
Here is how the classifier chains method performed:
Now let’s see if it can be beneficial to use predicted labels as features for other labels. Let us compare the performance of the classifier chains method with the binary relevance method (this method does not use predicted labels as features).
As can be seen here, it could indeed make sense to use more elaborate methods for multilabel classification, since classifier chains beat the binary relevance methods in all of these measures (Note, that hamming loss and subset01 are loss measures!).
Here I’ll show you how to use resampling methods in the multilabel setting. Resampling methods are key for assessing the performance of a learning algorithm. To read more about resampling, see the page on our tutorial.
First, we need to define a resampling strategy. I chose subsampling, which is also called Monte-Carlo cross-validation. The dataset is split into training and test set at a predefined ratio. The learner is trained on the training set, the performance is evaluated with the test set. This whole process is repeated many times and the performance values are averaged. In mlr this is done the following way:
Now we can choose a measure, which shall be resampled. All there is left to do is to run the resampling:
If you followed the mlr tutorial or if you are already familiar with mlr, you most likely saw, that using resampling in the multilabel setting isn’t any different than generally using resampling in mlr. Many methods, which are available in mlr, like preprocessing, tuning or benchmark experiments can also be used for multilabel datasets and the good thing here is: the syntax stays the same!
]]>