Integrating Another Filter Method

A lot of feature filter methods are already integrated in mlr and a complete list is given in the Appendix or can be obtained using listFilterMethods. You can easily add another filter, be it a brand new one or a method which is already implemented in another package, via function makeFilter.

Filter objects

In mlr all filter methods are objects of class Filter and are registered in an environment called .FilterRegister (where listFilterMethods looks them up to compile the list of available methods). To get to know their structure let's have a closer look at the "rank.correlation" filter which interfaces function correls in package Rfast.

filters = as.list(mlr:::.FilterRegister)
filters$rank.correlation
#> Filter: 'rank.correlation'
#> Packages: 'Rfast'
#> Supported tasks: regr
#> Supported features: numerics

str(filters$rank.correlation)
#> List of 6
#>  $ name              : chr "rank.correlation"
#>  $ desc              : chr "Spearman's correlation between feature and target"
#>  $ pkg               : chr "Rfast"
#>  $ supported.tasks   : chr "regr"
#>  $ supported.features: chr "numerics"
#>  $ fun               :function (task, nselect, ...)  
#>  - attr(*, "class")= chr "Filter"

filters$rank.correlation$fun
#> function (task, nselect, ...) 
#> {
#>     d = getTaskData(task, target.extra = TRUE)
#>     y = Rfast::correls(d$target, d$data, type = "spearman")
#>     for (i in which(is.na(y[, "correlation"]))) {
#>         y[i, "correlation"] = cor(d$target, d$data[, i], use = "complete.obs", 
#>             method = "spearman")
#>     }
#>     setNames(abs(y[, "correlation"]), getTaskFeatureNames(task))
#> }
#> <bytecode: 0xd90a5f0>
#> <environment: namespace:mlr>

The core element is $fun which calculates the feature importance. For the "rank.correlation" filter it just extracts the data and formula from the task and passes them on to the correls function.

Additionally, each Filter object has a $name, which should be short and is for example used to annotate graphics (cp. plotFilterValues), and a slightly more detailed description in slot $desc. If the filter method is implemented by another package its name is given in the $pkg member. Moreover, the supported task types and feature types are listed.

Writing a new filter method

You can integrate your own filter method using makeFilter. This function generates a Filter object and also registers it in the .FilterRegister environment.

The arguments of makeFilter correspond to the slot names of the Filter object above. Currently, feature filtering is only supported for supervised learning tasks and possible values for supported.tasks are "regr", "classif" and "surv". supported.features can be "numerics", "factors" and "ordered".

fun must be a function with at least the following formal arguments:

fun must return a named vector of feature importance values. By convention the most important features receive the highest scores.

If you are making use of the nselect option fun can either return a vector of nselect scores or a vector as long as the total numbers of features in the task filled with NAs for all features whose scores weren't calculated.

When writing fun many of the getter functions for Tasks come in handy, particularly getTaskData, getTaskFormula and getTaskFeatureNames. It's worth having a closer look at getTaskData which provides many options for formatting the data and recoding the target variable.

As a short demonstration we write a totally meaningless filter that determines the importance of features according to alphabetical order, i.e., giving highest scores to features with names that come first (decreasing = TRUE) or last (decreasing = FALSE) in the alphabet.

makeFilter(
  name = "nonsense.filter",
  desc = "Calculates scores according to alphabetical order of features",
  pkg = "",
  supported.tasks = c("classif", "regr", "surv"),
  supported.features = c("numerics", "factors", "ordered"),
  fun = function(task, nselect, decreasing = TRUE, ...) {
    feats = getTaskFeatureNames(task)
    imp = order(feats, decreasing = decreasing)
    names(imp) = feats
    imp
  }
)
#> Filter: 'nonsense.filter'
#> Packages: ''
#> Supported tasks: classif,regr,surv
#> Supported features: numerics,factors,ordered

The nonsense.filter is now registered in mlr and shown by listFilterMethods.

listFilterMethods()$id
#>  [1] anova.test                 carscore                  
#>  [3] cforest.importance         chi.squared               
#>  [5] gain.ratio                 information.gain          
#>  [7] kruskal.test               linear.correlation        
#>  [9] mrmr                       nonsense.filter           
#> [11] oneR                       permutation.importance    
#> [13] randomForest.importance    randomForestSRC.rfsrc     
#> [15] randomForestSRC.var.select rank.correlation          
#> [17] relief                     symmetrical.uncertainty   
#> [19] univariate.model.score     variance                  
#> 23 Levels: anova.test carscore cforest.importance ... variance

You can use it like any other filter method already integrated in mlr (i.e., via the method argument of generateFilterValuesData or the fw.method argument of makeFilterWrapper; see also the page on feature selection).

d = generateFilterValuesData(iris.task, method = c("nonsense.filter", "anova.test"))
d
#> FilterValues:
#> Task: iris-example
#>           name    type nonsense.filter anova.test
#> 1 Sepal.Length numeric               2  119.26450
#> 2  Sepal.Width numeric               1   49.16004
#> 3 Petal.Length numeric               4 1180.16118
#> 4  Petal.Width numeric               3  960.00715

plotFilterValues(d)

plot of chunk unnamed-chunk-4

iris.task.filtered = filterFeatures(iris.task, method = "nonsense.filter", abs = 2)
iris.task.filtered
#> Supervised task: iris-example
#> Type: classif
#> Target: Species
#> Observations: 150
#> Features:
#> numerics  factors  ordered 
#>        2        0        0 
#> Missings: FALSE
#> Has weights: FALSE
#> Has blocking: FALSE
#> Classes: 3
#>     setosa versicolor  virginica 
#>         50         50         50 
#> Positive class: NA

getTaskFeatureNames(iris.task.filtered)
#> [1] "Petal.Length" "Petal.Width"

You might also want to have a look at the source code of the filter methods already integrated in mlr for some more complex and meaningful examples.