Implements conditional sampling using k-nearest neighbors (kNN).
For each observation, finds the k most similar observations based on conditioning
features, then samples the target features from these neighbors.
Details
This sampler approximates the conditional distribution \(P(X_B | X_A = x_A)\) by:
Finding the k nearest neighbors of \(x_A\) in the training data
Sampling uniformly from the target feature values \(X_B\) of these k neighbors
This is a simple, non-parametric approach that:
Requires no distributional assumptions
Handles mixed feature types (numeric, integer, factor, ordered, logical)
Is computationally efficient (no model fitting required)
Adapts locally to the data structure
The method is related to hot-deck imputation and kNN imputation techniques used in missing data problems. As \(k \to \infty\) and \(k/n \to 0\), the kNN conditional distribution converges to the true conditional distribution under mild regularity conditions (Lipschitz continuity).
Distance Metrics:
The sampler supports two distance metrics:
Euclidean: For numeric/integer features only. Standardizes features before computing distances.
Gower: For mixed feature types. Handles numeric, factor, ordered, and logical features. Numeric features are range-normalized, categorical features use exact matching (0/1).
The distance parameter controls which metric to use:
"auto"(default): Automatically selects Euclidean for all-numeric features, Gower otherwise"euclidean": Forces Euclidean distance (errors if non-numeric features present)"gower": Forces Gower distance (works with any feature types)
Advantages:
Very fast (no model training)
Works with any feature types
Automatic distance metric selection
Naturally respects local data structure
Limitations:
Sensitive to choice of
kThe full task data is required for prediction
Can produce duplicates if
kis smallMay not extrapolate well to new regions
References
Little R, Rubin D (2019). Statistical Analysis with Missing Data, 3rd edition. John Wiley & Sons, Hoboken, NJ. ISBN 9780470526798.
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R (2001). “Missing Value Estimation Methods for DNA Microarrays.” Bioinformatics, 17(6), 520–525. doi:10.1093/bioinformatics/17.6.520 .
Super classes
xplainfi::FeatureSampler -> xplainfi::ConditionalSampler -> ConditionalKNNSampler
Public fields
feature_types(
character()) Feature types supported by the sampler.
Methods
Inherited methods
Method new()
Creates a new ConditionalKNNSampler.
Usage
ConditionalKNNSampler$new(task, conditioning_set = NULL, k = 5L)Arguments
task(mlr3::Task) Task to sample from.
conditioning_set(
character|NULL) Default conditioning set to use in$sample().k(
integer(1):5L) Number of nearest neighbors to sample from.
Method sample()
Sample features from their kNN-based conditional distribution.
Arguments
feature(
character()) Feature name(s) to sample.row_ids(
integer()|NULL) Row IDs from task to use as conditioning values.conditioning_set(
character()|NULL) Features to condition on. IfNULL, samples from marginal distribution (random sampling from training data).k(
integer(1)|NULL) Number of neighbors. IfNULL, uses stored parameter.
Method sample_newdata()
Sample from external data conditionally.
Arguments
feature(
character()) Feature(s) to sample.newdata(
data.table) External data to use.conditioning_set(
character()|NULL) Features to condition on.k(
integer(1)|NULL) Number of neighbors. IfNULL, uses stored parameter.
Examples
library(mlr3)
task = tgen("friedman1")$generate(n = 100)
sampler = ConditionalKNNSampler$new(task, k = 5)
# Sample features conditioned on others
test_data = task$data(rows = 1:5)
sampled = sampler$sample_newdata(
feature = c("important2", "important3"),
newdata = test_data,
conditioning_set = "important1"
)