Introducing mlr3cluster: Cluster Analysis Package

R CRAN

Tired of learning to use multiple packages to access clustering algorithms?

Damir Pulatov https://github.com/damirpolat
2020-08-26

Tired of learning to use multiple packages to access clustering algorithms?

Using different packages makes it difficult to compare the performance of clusterers?

It would be great to have just one package that makes interfacing all things clustering easy?

mlr3cluster to the rescue!

mlr3cluster is a cluster analysis extention package within the mlr3 ecosystem. It is a successsor of mlr’s cluster capabilities in spirit and functionality.

In order to understand the following introduction and tutorial you need to be familiar with R6 and mlr3 basics. See chapters 1-2 of the mlr3book if you need a refresher.

Installation

To install the package, run the following code chunk:

install.packages("mlr3cluster")

Getting Started

Assuming you know all the basics and you’ve installed the package, here’s an example on how to perform k-means clustering on a classic usarrests data set:

library(mlr3)
library(mlr3cluster)

task = mlr_tasks$get("usarrests")
learner = mlr_learners$get("clust.kmeans")
learner$train(task)
preds = learner$predict(task = task)

preds
<PredictionClust> for 50 observations:
    row_ids partition
          1         2
          2         2
          3         2
---                  
         48         1
         49         1
         50         1

Integrated Learners

What built-in clusterers does the package come with? Here is a list of integrated learners:

mlr_learners$keys("clust")
 [1] "clust.agnes"        "clust.ap"           "clust.cmeans"      
 [4] "clust.cobweb"       "clust.dbscan"       "clust.diana"       
 [7] "clust.em"           "clust.fanny"        "clust.featureless" 
[10] "clust.ff"           "clust.hclust"       "clust.kkmeans"     
[13] "clust.kmeans"       "clust.MBatchKMeans" "clust.meanshift"   
[16] "clust.pam"          "clust.SimpleKMeans" "clust.xmeans"      

The library contains all the basic types of clusterers: partitional, hierarchial, density-based and fuzzy. Below is a detailed list of all the learners.

ID Learner Package
clust.agnes Agglomerative Hierarchical Clustering cluster
clust.cmeans Fuzzy C-Means Clustering e1071
clust.dbscan Density-based Clustering dbscan
clust.diana Divisive Hierarchical Clustering cluster
clust.fanny Fuzzy Clustering cluster
clust.featureless Simple Featureless Clustering mlr3cluster
clust.kmeans K-Means Clustering stats
clust.pam Clustering Around Medoids cluster
clust.xmeans K-Means with Automatic Determination of k RWeka

Integrated Measures

List of integrated cluster measures:

mlr_measures$keys("clust")
[1] "clust.ch"         "clust.db"         "clust.dunn"      
[4] "clust.silhouette" "clust.wss"       

Below is a detailed list of all the integrated learners.

ID Measure Package
clust.db Davies-Bouldin Cluster Separation clusterCrit
clust.dunn Dunn index clusterCrit
clust.ch Calinski Harabasz Pseudo F-Statistic clusterCrit
clust.silhouette Rousseeuw’s Silhouette Quality Index clusterCrit

Integrated Tasks

There is only one built-in Task in the package:

mlr_tasks$get("usarrests")
<TaskClust:usarrests> (50 x 4)
* Target: -
* Properties: -
* Features (4):
  - int (2): Assault, UrbanPop
  - dbl (2): Murder, Rape

As you can see, the biggest difference in clustering tasks as compared to the rest of the tasks in mlr3 is the absense of the Target column.

Hyperparameters

Setting hyperparameters for clusterers is as easy as setting parameters for any other mlr3 learner:

task = mlr_tasks$get("usarrests")
learner = mlr_learners$get("clust.kmeans")
learner$param_set
<ParamSet>
          id    class lower upper nlevels       default value
1:   centers ParamUty    NA    NA     Inf             2     2
2:  iter.max ParamInt     1   Inf     Inf            10      
3: algorithm ParamFct    NA    NA       4 Hartigan-Wong      
4:    nstart ParamInt     1   Inf     Inf             1      
5:     trace ParamInt     0   Inf     Inf             0      
learner$param_set$values = list(centers = 3L, algorithm = "Lloyd", iter.max = 100L)

Train and Predict

The “train” method is simply creating a model with cluster assignments for data, while the “predict” method’s functionality varies depending on the clusterer in question. Read the each learner’s documentation for details.

For example, the kmeans learner’s predict method uses clue::cl_predict which performs cluster assignments for new data by looking at the “closest” neighbors of the new observations.

Following the example from the previous section:

task = mlr_tasks$get("usarrests")
train_set = sample(task$nrow, 0.8 * task$nrow)
test_set = setdiff(seq_len(task$nrow), train_set)

learner = mlr_learners$get("clust.kmeans")
learner$train(task, row_ids = train_set)

preds = learner$predict(task, row_ids = test_set)
preds
<PredictionClust> for 10 observations:
    row_ids partition
          1         2
          4         1
          6         2
---                  
         38         1
         47         1
         48         1

Benchmarking and Evaluation

To assess the quality of any machine learning experiment, you need to choose an evaluation metric that makes the most sense. Let’s design an experiment that will allow you to compare the performance of three different clusteres on the same task. The mlr3 library provides benchmarking functionality that lets you create such experiments.

# design an experiment by specifying task(s), learner(s), resampling method(s)
design = benchmark_grid(
  tasks = tsk("usarrests"),
  learners = list(
    lrn("clust.kmeans", centers = 3L),
    lrn("clust.pam", k = 3L),
    lrn("clust.cmeans", centers = 3L)),
  resamplings = rsmp("holdout"))
print(design)
              task                  learner              resampling
1: <TaskClust[45]> <LearnerClustKMeans[37]> <ResamplingHoldout[19]>
2: <TaskClust[45]>    <LearnerClustPAM[37]> <ResamplingHoldout[19]>
3: <TaskClust[45]> <LearnerClustCMeans[37]> <ResamplingHoldout[19]>
# execute benchmark
bmr = benchmark(design)
INFO  [20:59:47.643] [mlr3] Running benchmark with 3 resampling iterations 
INFO  [20:59:47.757] [mlr3] Applying learner 'clust.pam' on task 'usarrests' (iter 1/1) 
INFO  [20:59:47.950] [mlr3] Applying learner 'clust.kmeans' on task 'usarrests' (iter 1/1) 
INFO  [20:59:47.965] [mlr3] Applying learner 'clust.cmeans' on task 'usarrests' (iter 1/1) 
INFO  [20:59:48.016] [mlr3] Finished benchmark 
# define measure
measures = list(msr("clust.silhouette"))

bmr$aggregate(measures)
   nr      resample_result   task_id   learner_id resampling_id iters
1:  1 <ResampleResult[22]> usarrests clust.kmeans       holdout     1
2:  2 <ResampleResult[22]> usarrests    clust.pam       holdout     1
3:  3 <ResampleResult[22]> usarrests clust.cmeans       holdout     1
   clust.silhouette
1:        0.3638224
2:        0.5577741
3:        0.3638224

Visualization

How do you visualize clustering tasks and results? The mlr3viz package (version >= 0.40) now provides that functionality.

install.packages("mlr3viz")
library(mlr3viz)

task = mlr_tasks$get("usarrests")
learner = mlr_learners$get("clust.kmeans")
learner$param_set$values = list(centers = 3L)
learner$train(task)
preds = learner$predict(task)

# Task visualization
autoplot(task)
# Pairs plot with cluster assignments
autoplot(preds, task)
# Silhouette plot with mean silhouette value as reference line
autoplot(preds, task, type = "sil")
# Performing PCA on task data and showing cluster assignments
autoplot(preds, task, type = "pca")

Keep in mind that mlr3viz::autoplot also provides more options depending on the kind of plots you’re interested in. For example, to draw borders around clusters, provide appropriate parameters from ggfortify::autoplot.kmeans:

autoplot(preds, task, type = "pca", frame = TRUE)

You can also easily visualize dendrograms:

task = mlr_tasks$get("usarrests")
learner = mlr_learners$get("clust.agnes")
learner$train(task)

# Simple dendrogram
autoplot(learner)
# More advanced options from `factoextra::fviz_dend`
autoplot(learner,
  k = learner$param_set$values$k, rect_fill = TRUE,
  rect = TRUE, rect_border = c("red", "cyan"))

Further Development

If you have any issues with the package or would like to request a new feature, feel free to open an issue here.

Acknowledgements

I would like to thank the following people for their help and guidance: Michel Lang, Lars Kotthoff, Martin Binder, Patrick Schratz, Bernd Bischl.

Citation

For attribution, please cite this work as

Pulatov (2020, Aug. 26). mlr-org: Introducing mlr3cluster: Cluster Analysis Package. Retrieved from https://mlr-org.github.io/mlr-org-website/posts/2020-08-26-introducing-mlr3cluster-cluster-analysis-package/

BibTeX citation

@misc{pulatov2020introducing,
  author = {Pulatov, Damir},
  title = {mlr-org: Introducing mlr3cluster: Cluster Analysis Package},
  url = {https://mlr-org.github.io/mlr-org-website/posts/2020-08-26-introducing-mlr3cluster-cluster-analysis-package/},
  year = {2020}
}