Tired of learning to use multiple packages to access clustering algorithms?
Tired of learning to use multiple packages to access clustering algorithms?
Using different packages makes it difficult to compare the performance of clusterers?
It would be great to have just one package that makes interfacing all things clustering easy?
mlr3cluster to the rescue!
mlr3cluster is a cluster analysis extention package within the mlr3 ecosystem. It is a successsor of mlr’s cluster capabilities in spirit and functionality.
In order to understand the following introduction and tutorial you need to be familiar with R6 and mlr3 basics. See chapters 1-2 of the mlr3book if you need a refresher.
To install the package, run the following code chunk:
install.packages("mlr3cluster")
Assuming you know all the basics and you’ve installed the package, here’s an example on how to perform k-means clustering on a classic usarrests data set:
library(mlr3)
library(mlr3cluster)
task = mlr_tasks$get("usarrests")
learner = mlr_learners$get("clust.kmeans")
learner$train(task)
preds = learner$predict(task = task)
preds
<PredictionClust> for 50 observations:
row_ids partition
1 2
2 2
3 2
---
48 1
49 1
50 1
What built-in clusterers does the package come with? Here is a list of integrated learners:
mlr_learners$keys("clust")
[1] "clust.agnes" "clust.ap" "clust.cmeans"
[4] "clust.cobweb" "clust.dbscan" "clust.diana"
[7] "clust.em" "clust.fanny" "clust.featureless"
[10] "clust.ff" "clust.hclust" "clust.kkmeans"
[13] "clust.kmeans" "clust.MBatchKMeans" "clust.meanshift"
[16] "clust.pam" "clust.SimpleKMeans" "clust.xmeans"
The library contains all the basic types of clusterers: partitional, hierarchial, density-based and fuzzy. Below is a detailed list of all the learners.
ID | Learner | Package |
---|---|---|
clust.agnes | Agglomerative Hierarchical Clustering | cluster |
clust.cmeans | Fuzzy C-Means Clustering | e1071 |
clust.dbscan | Density-based Clustering | dbscan |
clust.diana | Divisive Hierarchical Clustering | cluster |
clust.fanny | Fuzzy Clustering | cluster |
clust.featureless | Simple Featureless Clustering | mlr3cluster |
clust.kmeans | K-Means Clustering | stats |
clust.pam | Clustering Around Medoids | cluster |
clust.xmeans | K-Means with Automatic Determination of k | RWeka |
List of integrated cluster measures:
mlr_measures$keys("clust")
[1] "clust.ch" "clust.db" "clust.dunn"
[4] "clust.silhouette" "clust.wss"
Below is a detailed list of all the integrated learners.
ID | Measure | Package |
---|---|---|
clust.db | Davies-Bouldin Cluster Separation | clusterCrit |
clust.dunn | Dunn index | clusterCrit |
clust.ch | Calinski Harabasz Pseudo F-Statistic | clusterCrit |
clust.silhouette | Rousseeuw’s Silhouette Quality Index | clusterCrit |
There is only one built-in Task in the package:
mlr_tasks$get("usarrests")
<TaskClust:usarrests> (50 x 4)
* Target: -
* Properties: -
* Features (4):
- int (2): Assault, UrbanPop
- dbl (2): Murder, Rape
As you can see, the biggest difference in clustering tasks as compared to the rest of the tasks in mlr3 is the absense of the Target column.
Setting hyperparameters for clusterers is as easy as setting parameters for any other mlr3 learner:
task = mlr_tasks$get("usarrests")
learner = mlr_learners$get("clust.kmeans")
learner$param_set
<ParamSet>
id class lower upper nlevels default value
1: centers ParamUty NA NA Inf 2 2
2: iter.max ParamInt 1 Inf Inf 10
3: algorithm ParamFct NA NA 4 Hartigan-Wong
4: nstart ParamInt 1 Inf Inf 1
5: trace ParamInt 0 Inf Inf 0
learner$param_set$values = list(centers = 3L, algorithm = "Lloyd", iter.max = 100L)
The “train” method is simply creating a model with cluster assignments for data, while the “predict” method’s functionality varies depending on the clusterer in question. Read the each learner’s documentation for details.
For example, the kmeans
learner’s predict method uses clue::cl_predict
which performs cluster assignments for new data by looking at the “closest” neighbors of the new observations.
Following the example from the previous section:
task = mlr_tasks$get("usarrests")
train_set = sample(task$nrow, 0.8 * task$nrow)
test_set = setdiff(seq_len(task$nrow), train_set)
learner = mlr_learners$get("clust.kmeans")
learner$train(task, row_ids = train_set)
preds = learner$predict(task, row_ids = test_set)
preds
<PredictionClust> for 10 observations:
row_ids partition
1 2
4 1
6 2
---
38 1
47 1
48 1
To assess the quality of any machine learning experiment, you need to choose an evaluation metric that makes the most sense. Let’s design an experiment that will allow you to compare the performance of three different clusteres on the same task. The mlr3 library provides benchmarking functionality that lets you create such experiments.
# design an experiment by specifying task(s), learner(s), resampling method(s)
design = benchmark_grid(
tasks = tsk("usarrests"),
learners = list(
lrn("clust.kmeans", centers = 3L),
lrn("clust.pam", k = 3L),
lrn("clust.cmeans", centers = 3L)),
resamplings = rsmp("holdout"))
print(design)
task learner resampling
1: <TaskClust[45]> <LearnerClustKMeans[37]> <ResamplingHoldout[19]>
2: <TaskClust[45]> <LearnerClustPAM[37]> <ResamplingHoldout[19]>
3: <TaskClust[45]> <LearnerClustCMeans[37]> <ResamplingHoldout[19]>
# execute benchmark
bmr = benchmark(design)
INFO [20:59:47.643] [mlr3] Running benchmark with 3 resampling iterations
INFO [20:59:47.757] [mlr3] Applying learner 'clust.pam' on task 'usarrests' (iter 1/1)
INFO [20:59:47.950] [mlr3] Applying learner 'clust.kmeans' on task 'usarrests' (iter 1/1)
INFO [20:59:47.965] [mlr3] Applying learner 'clust.cmeans' on task 'usarrests' (iter 1/1)
INFO [20:59:48.016] [mlr3] Finished benchmark
nr resample_result task_id learner_id resampling_id iters
1: 1 <ResampleResult[22]> usarrests clust.kmeans holdout 1
2: 2 <ResampleResult[22]> usarrests clust.pam holdout 1
3: 3 <ResampleResult[22]> usarrests clust.cmeans holdout 1
clust.silhouette
1: 0.3638224
2: 0.5577741
3: 0.3638224
How do you visualize clustering tasks and results? The mlr3viz
package (version >= 0.40) now provides that functionality.
install.packages("mlr3viz")
library(mlr3viz)
task = mlr_tasks$get("usarrests")
learner = mlr_learners$get("clust.kmeans")
learner$param_set$values = list(centers = 3L)
learner$train(task)
preds = learner$predict(task)
# Task visualization
autoplot(task)
# Pairs plot with cluster assignments
autoplot(preds, task)
# Silhouette plot with mean silhouette value as reference line
autoplot(preds, task, type = "sil")
# Performing PCA on task data and showing cluster assignments
autoplot(preds, task, type = "pca")
Keep in mind that mlr3viz::autoplot
also provides more options depending on the kind of plots you’re interested in. For example, to draw borders around clusters, provide appropriate parameters from ggfortify::autoplot.kmeans
:
autoplot(preds, task, type = "pca", frame = TRUE)
You can also easily visualize dendrograms:
task = mlr_tasks$get("usarrests")
learner = mlr_learners$get("clust.agnes")
learner$train(task)
# Simple dendrogram
autoplot(learner)
If you have any issues with the package or would like to request a new feature, feel free to open an issue here.
I would like to thank the following people for their help and guidance: Michel Lang, Lars Kotthoff, Martin Binder, Patrick Schratz, Bernd Bischl.
For attribution, please cite this work as
Pulatov (2020, Aug. 26). mlr-org: Introducing mlr3cluster: Cluster Analysis Package. Retrieved from https://mlr-org.github.io/mlr-org-website/posts/2020-08-26-introducing-mlr3cluster-cluster-analysis-package/
BibTeX citation
@misc{pulatov2020introducing, author = {Pulatov, Damir}, title = {mlr-org: Introducing mlr3cluster: Cluster Analysis Package}, url = {https://mlr-org.github.io/mlr-org-website/posts/2020-08-26-introducing-mlr3cluster-cluster-analysis-package/}, year = {2020} }