Keras TabNet Neural Network for Classification

Implementation of "TabNet" from the paper TabNet: Attentive Interpretable Tabular Learning (Sercan, Pfister, 2019). See https://arxiv.org/abs/1908.07442 for details.

Format

R6::R6Class() inheriting from LearnerClassifKeras.

Construction

LearnerClassifTabNet$new()
mlr3::mlr_learners$get("classif.tabnet")
mlr3::lrn("classif.tabnet")

Hyper Parameter Tuning

Additional Arguments:

embed_size: Size of embedding for categorical, character and ordered factors. Defaults to min(600L, round(1.6 * length(levels)^0.56)).
stacked: Should a StackedTabNetModel be used instead of a normal TabNetModel?

Excerpt from paper

We consider datasets ranging from 10K to 10M training points, with varying degrees of fitting difficulty. TabNet obtains high performance for all with a few general principles on hyperparameter selection:

Most datasets yield the best results for Nsteps between 3 and 10. Typically, larger datasets and more complex tasks require a larger Nsteps. A very high value of Nsteps may suffer from overfitting and yield poor generalization.
Adjustment of the values of Nd and Na is the most efficient way of obtaining a trade-off between performance and complexity. Nd = Na is a reasonable choice for most datasets. A very high value of Nd and Na may suffer from overfitting and yield poor generalization.
An optimal choice of $\gamma$ can have a major role on the overall performance. Typically a larger Nsteps value favors for a larger $\gamma$.
A large batch size is beneficial for performance - if the memory constraints permit, as large as 1-10 % of the total training dataset size is suggested. The virtual batch size is typically much smaller than the batch size.
Initially large learning rate is important, which should be gradually decayed until convergence.

The R class wraps a python implementation found in https://github.com/titu1994/tf-TabNet/tree/master/tabnet.

Parameters

feature_dim (N_a): Dimensionality of the hidden representation in feature transformation block. Each layer first maps the representation to a 2*feature_dim-dimensional output and half of it is used to determine the nonlinearity of the GLU activation where the other half is used as an input to GLU, and eventually feature_dim-dimensional output is transferred to the next layer.
output_dim (N_d): Dimensionality of the outputs of each decision step, which is later mapped to the final classification or regression output.
num_features: The number of input features (i.e the number of columns for tabular data assuming each feature is represented with 1 dimension).
num_decision_steps (N_steps): Number of sequential decision steps.
relaxation_factor (gamma): Relaxation factor that promotes the reuse of each feature at different decision steps. When it is 1, a feature is enforced to be used only at one decision step and as it increases, more
flexibility is provided to use a feature at multiple decision steps.
sparsity_coefficient (lambda_sparse): Strength of the sparsity regularization. Sparsity may provide a favorable inductive bias for convergence to higher accuracy for some datasets where most of the input features are redundant.
norm_type: Type of normalization to perform for the model. Can be either 'batch' or 'group'. 'group' is the default.
batch_momentum: Momentum in ghost batch normalization.
virtual_batch_size: Virtual batch size in ghost batch normalization. The overall batch size should be an integer multiple of virtual_batch_size.
num_groups: Number of groups used for group normalization. The number of groups should be a divisor of the number of input features (num_features)
epsilon: A small number for numerical stability of the entropy calculations.
num_layers: Required for stacked tabnet. Automatically set to 1L if not provided.

Learner Methods

Keras Learners offer several methods for easy access to the stored models.

.$plot()
Plots the history, i.e. the train-validation loss during training.
.$save(file_path)
Dumps the model to a provided file_path in 'h5' format.
.$load_model_from_file(file_path)
Loads a model saved using saved back into the learner. The model needs to be saved separately when the learner is serialized. In this case, the learner can be restored from this function. Currently not implemented for 'TabNet'.
.$lr_find(task, epochs, lr_min, lr_max, batch_size)
Employ an implementation of the learning rate finder as popularized by Jeremy Howard in fast.ai (http://course.fast.ai/) for the learner. For more info on parameters, see find_lr.

References

Sercan, A. and Pfister, T. (2019): TabNet. https://arxiv.org/abs/1908.07442.

Examples

learner = mlr3::lrn("classif.tabnet")
#> Loading required package: tensorflow
print(learner)
#> <LearnerClassifTabNet:classif.tabnet>
#> * Model: -
#> * Parameters: epochs=100, validation_split=0, batch_size=128,
#>   callbacks=<list>, low_memory=FALSE, verbose=0, embed_size=<NULL>,
#>   stacked=FALSE, batch_momentum=0.98, relaxation_factor=1,
#>   sparsity_coefficient=1e-05, num_decision_steps=2, feature_dim=64,
#>   output_dim=64, num_groups=1, epsilon=1e-05, norm_type=group,
#>   virtual_batch_size=<NULL>, loss=categorical_crossentropy,
#>   optimizer=<keras.optimizer_v2.adam.Adam>, metrics=accuracy
#> * Packages: keras, tensorflow, reticulate
#> * Predict Type: response
#> * Feature types: integer, numeric, factor, logical, ordered
#> * Properties: multiclass, twoclass

# available parameters:
learner$param_set$ids()
#>  [1] "epochs"               "model"                "class_weight"        
#>  [4] "validation_split"     "batch_size"           "callbacks"           
#>  [7] "low_memory"           "verbose"              "embed_size"          
#> [10] "stacked"              "num_layers"           "batch_momentum"      
#> [13] "relaxation_factor"    "sparsity_coefficient" "num_decision_steps"  
#> [16] "feature_dim"          "output_dim"           "num_groups"          
#> [19] "epsilon"              "norm_type"            "virtual_batch_size"  
#> [22] "loss"                 "optimizer"            "metrics"