Implementation of "TabNet" from the paper TabNet: Attentive Interpretable Tabular Learning (Sercan, Pfister, 2019). See https://arxiv.org/abs/1908.07442 for details.
R6::R6Class()
inheriting from LearnerClassifKeras.
LearnerClassifTabNet$new() mlr3::mlr_learners$get("classif.tabnet") mlr3::lrn("classif.tabnet")
Additional Arguments:
embed_size
: Size of embedding for categorical, character and ordered factors.
Defaults to min(600L, round(1.6 * length(levels)^0.56))
.
stacked
: Should a StackedTabNetModel
be used instead of a normal TabNetModel
?
We consider datasets ranging from 10K to 10M training points, with varying degrees of fitting
difficulty. TabNet obtains high performance for all with a few general principles on hyperparameter
selection:
Most datasets yield the best results for Nsteps between 3 and 10. Typically, larger datasets and
more complex tasks require a larger Nsteps. A very high value of Nsteps may suffer from
overfitting and yield poor generalization.
Adjustment of the values of Nd and Na is the most efficient way of obtaining a trade-off
between performance and complexity. Nd = Na is a reasonable choice for most datasets. A
very high value of Nd and Na may suffer from overfitting and yield poor generalization.
An optimal choice of \(\gamma\) can have a major role on the overall performance. Typically a larger
Nsteps value favors for a larger \(\gamma\).
A large batch size is beneficial for performance - if the memory constraints permit, as large
as 1-10 % of the total training dataset size is suggested. The virtual batch size is typically
much smaller than the batch size.
Initially large learning rate is important, which should be gradually decayed until convergence.
The R class wraps a python implementation found in https://github.com/titu1994/tf-TabNet/tree/master/tabnet.
feature_dim
(N_a): Dimensionality of the hidden representation in feature
transformation block. Each layer first maps the representation to a
2*feature_dim-dimensional output and half of it is used to determine the
nonlinearity of the GLU activation where the other half is used as an
input to GLU, and eventually feature_dim-dimensional output is
transferred to the next layer.
output_dim
(N_d): Dimensionality of the outputs of each decision step, which is
later mapped to the final classification or regression output.
num_features
: The number of input features (i.e the number of columns for
tabular data assuming each feature is represented with 1 dimension).
num_decision_steps
(N_steps): Number of sequential decision steps.
relaxation_factor
(gamma): Relaxation factor that promotes the reuse of each
feature at different decision steps. When it is 1, a feature is enforced
to be used only at one decision step and as it increases, more
flexibility is provided to use a feature at multiple decision steps.
sparsity_coefficient
(lambda_sparse): Strength of the sparsity regularization.
Sparsity may provide a favorable inductive bias for convergence to
higher accuracy for some datasets where most of the input features are redundant.
norm_type
: Type of normalization to perform for the model. Can be either
'batch' or 'group'. 'group' is the default.
batch_momentum
: Momentum in ghost batch normalization.
virtual_batch_size
: Virtual batch size in ghost batch normalization. The
overall batch size should be an integer multiple of virtual_batch_size.
num_groups
: Number of groups used for group normalization. The number of groups
should be a divisor of the number of input features (num_features
)
epsilon
: A small number for numerical stability of the entropy calculations.
num_layers
: Required for stacked tabnet. Automatically set to 1L
if not provided.
Keras Learners offer several methods for easy access to the stored models.
.$plot()
Plots the history, i.e. the train-validation loss during training.
.$save(file_path)
Dumps the model to a provided file_path in 'h5' format.
.$load_model_from_file(file_path)
Loads a model saved using saved
back into the learner.
The model needs to be saved separately when the learner is serialized.
In this case, the learner can be restored from this function.
Currently not implemented for 'TabNet'.
.$lr_find(task, epochs, lr_min, lr_max, batch_size)
Employ an implementation of the learning rate finder as popularized by
Jeremy Howard in fast.ai (http://course.fast.ai/) for the learner.
For more info on parameters, see find_lr
.
Sercan, A. and Pfister, T. (2019): TabNet. https://arxiv.org/abs/1908.07442.
#>#> <LearnerClassifTabNet:classif.tabnet> #> * Model: - #> * Parameters: epochs=100, validation_split=0, batch_size=128, #> callbacks=<list>, low_memory=FALSE, verbose=0, embed_size=<NULL>, #> stacked=FALSE, batch_momentum=0.98, relaxation_factor=1, #> sparsity_coefficient=1e-05, num_decision_steps=2, feature_dim=64, #> output_dim=64, num_groups=1, epsilon=1e-05, norm_type=group, #> virtual_batch_size=<NULL>, loss=categorical_crossentropy, #> optimizer=<keras.optimizer_v2.adam.Adam>, metrics=accuracy #> * Packages: keras, tensorflow, reticulate #> * Predict Type: response #> * Feature types: integer, numeric, factor, logical, ordered #> * Properties: multiclass, twoclass# available parameters: learner$param_set$ids()#> [1] "epochs" "model" "class_weight" #> [4] "validation_split" "batch_size" "callbacks" #> [7] "low_memory" "verbose" "embed_size" #> [10] "stacked" "num_layers" "batch_momentum" #> [13] "relaxation_factor" "sparsity_coefficient" "num_decision_steps" #> [16] "feature_dim" "output_dim" "num_groups" #> [19] "epsilon" "norm_type" "virtual_batch_size" #> [22] "loss" "optimizer" "metrics"