Learning tasks encapsulate the data set and further relevant information about a machine learning problem, for example the name of the target variable for supervised problems.

The tasks are organized in a hierarchy, with the generic Task at the top. The following tasks can be instantiated and all inherit from the virtual superclass Task:

To create a task, just call make<TaskType>, e.g., makeClassifTask. All tasks require an identifier (argument id) and a data.frame (argument data). If no ID is provided it is automatically generated using the variable name of the data. The ID will be later used to name results, for example of benchmark experiments, and to annotate plots. Depending on the nature of the learning problem, additional arguments may be required and are discussed in the following sections.

### Regression

For supervised learning like regression (as well as classification and survival analysis) we, in addition to data, have to specify the name of the target variable.

data(BostonHousing, package = "mlbench")
#> Type: regr
#> Target: medv
#> Observations: 506
#> Features:
#> numerics  factors  ordered
#>       12        1        0
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE


As you can see, the Task records the type of the learning problem and basic information about the data set, e.g., the types of the features (numeric vectors, factors or ordered factors), the number of observations, or whether missing values are present.

Creating tasks for classification and survival analysis follows the same scheme, the data type of the target variables included in data is simply different. For each of these learning problems some specifics are described below.

### Classification

For classification the target column has to be a factor.

In the following example we define a classification task for the BreastCancer data set and exclude the variable Id from all further model fitting and evaluation.

data(BreastCancer, package = "mlbench")
df = BreastCancer
df$Species = NULL
#> [1] "BreastCancer"
#>
#> $type #> [1] "classif" #> #>$target
#> [1] "Class"
#>
#> $size #> [1] 699 #> #>$n.feat
#> numerics  factors  ordered
#>        0        4        5
#>
#> $has.missings #> [1] TRUE #> #>$has.weights
#> [1] FALSE
#>
#> $has.blocking #> [1] FALSE #> #>$class.levels
#> [1] "benign"    "malignant"
#>
#> positive #> [1] "malignant" #> #>negative
#> [1] "benign"
#>
#> attr(,"class")


Note that task descriptions have slightly different elements for different types of Tasks. Frequently required elements can also be accessed directly.

## Get the ID
#> [1] "BreastCancer"

## Get the type of task
#> [1] "classif"

## Get the names of the target columns
#> [1] "Class"

## Get the number of observations
#> [1] 699

## Get the number of input variables
#> [1] 9

## Get the class levels in classif.task
#> [1] "benign"    "malignant"


Moreover, mlr provides several functions to extract data from a Task.

## Accessing the data set in classif.task
#> 'data.frame':    699 obs. of  10 variables:
#>  $Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ... #>$ Cell.size      : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
#>  $Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ... #>$ Marg.adhesion  : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
#>  $Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ... #>$ Bare.nuclei    : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
#>  $Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ... #>$ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
#>  $Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ... #>$ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

## Get the names of the input variables in cluster.task
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"

## Get the values of the target variables in surv.task
#>   time status
#> 1  306   TRUE
#> 2  455   TRUE
#> 3 1010  FALSE
#> 4  210   TRUE
#> 5  883   TRUE
#> 6 1022  FALSE

## Get the cost matrix in costsens.task
#> NULL


Note that getTaskData offers many options for converting the data set into a convenient format. This especially comes in handy when you integrate a new learner from another R package into mlr. In this regard function getTaskFormula is also useful.

mlr provides several functions to alter an existing Task, which is often more convenient than creating a new Task from scratch. Here are some examples.

## Select observations and/or features

## It may happen, especially after selecting observations, that features are constant.
## These should be removed.
#> Removing 1 columns: am
#> Type: cluster
#> Observations: 14
#> Features:
#> numerics  factors  ordered
#>       10        0        0
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE

## Remove selected features
#> Type: surv
#> Target: time,status
#> Events: 165
#> Observations: 228
#> Features:
#> numerics  factors  ordered
#>        6        0        0
#> Missings: TRUE
#> Has weights: FALSE
#> Has blocking: FALSE

## Standardize numerical features
#>       mpg              cyl              disp              hp
#>  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000
#>  1st Qu.:0.3161   1st Qu.:0.5000   1st Qu.:0.1242   1st Qu.:0.2801
#>  Median :0.5107   Median :1.0000   Median :0.4076   Median :0.6311
#>  Mean   :0.4872   Mean   :0.7143   Mean   :0.4430   Mean   :0.5308
#>  3rd Qu.:0.6196   3rd Qu.:1.0000   3rd Qu.:0.6618   3rd Qu.:0.7473
#>  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
#>       drat              wt              qsec              vs
#>  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000
#>  1st Qu.:0.2672   1st Qu.:0.1275   1st Qu.:0.2302   1st Qu.:0.0000
#>  Median :0.3060   Median :0.1605   Median :0.3045   Median :0.0000
#>  Mean   :0.4544   Mean   :0.3268   Mean   :0.3752   Mean   :0.4286
#>  3rd Qu.:0.7026   3rd Qu.:0.3727   3rd Qu.:0.4908   3rd Qu.:1.0000
#>  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000
#>        am           gear             carb
#>  Min.   :0.5   Min.   :0.0000   Min.   :0.0000
#>  1st Qu.:0.5   1st Qu.:0.0000   1st Qu.:0.3333
#>  Median :0.5   Median :0.0000   Median :0.6667
#>  Mean   :0.5   Mean   :0.2857   Mean   :0.6429
#>  3rd Qu.:0.5   3rd Qu.:0.7500   3rd Qu.:1.0000
#>  Max.   :0.5   Max.   :1.0000   Max.   :1.0000


For more functions and more detailed explanations have a look at the data preprocessing page.

## Example tasks and convenience functions

For your convenience mlr provides pre-defined Tasks for each type of learning problem. These are also used throughout this tutorial in order to get shorter and more readable code. A list of all Tasks can be found in the Appendix.

Moreover, mlr's function convertMLBenchObjToTask can generate Tasks from the data sets and data generating functions in package mlbench.