library(tidyverse)
library(PracTools)
library(ranger)
library(neuralnet)
library(formattable)
library(mlr3)
library(mlr3learners)
library(mcboost)
This vignette presents two typical use cases of MCBoost with data from a health survey. The goal is to post-process two initial prediction models for multi-accuracy using different flavors of MCBoost, and to eventually compare the naive and post-processed predictors overall and for subpopulations. The first scenario starts with a neural net and, as an example, evaluates the initial and post-processed predictors with a focus on subgroup accuracy after running MCBoost. The second scenario uses a random forest and evaluates the initial and post-processed predictors with respect to subgroup calibration.
We use data derived from the National Health Interview Survey (NHIS 2003), which includes demographic and health-related variables for 21,588 individuals. This data can directly be included from the PracTools
package.
data(nhis.large)
We can obtain more information using:
?nhis.large
In the following, our outcome of interest is whether an individual is covered by any type of health insurance (notcov
, 1 = not covered, 0 = covered). We additionally prepare two sets of variables:
The second set of variables will not be used for training the initial prediction models, but will be our focus when it comes to evaluating prediction performance for subgroups.
Before we training an initial model, we preprocess the data:
factor
.NA
s in categorical features to a dedicated factor levelNA
s in the outcome variable notcov
notcov
as a factor variable instead of a dummy variable (1 = notcov
, 0 = cov
)inv_wt
as the inverse of survey weights svwyt
categorical <- c("age.grp", "parents", "educ", "inc.grp", "doing.lw",
"limited", "sex", "hisp", "race")
nhis <- nhis.large %>%
mutate_at(categorical, as.factor) %>%
mutate_at(categorical, fct_explicit_na) %>%
drop_na(notcov) %>%
select(all_of(categorical), notcov, svywt, ID)
nhis$notcov <- factor(ifelse(nhis$notcov == 1, "notcov", "cov"))
nhis_enc <- data.frame(model.matrix(notcov ~ ., data = nhis)[,-1])
nhis_enc$notcov <- nhis$notcov
nhis_enc$sex <- nhis$sex
nhis_enc$hisp <- nhis$hisp
nhis_enc$race <- nhis$race
nhis_enc$inv_wt <- (1 / nhis$svywt)
The pre-processed NHIS data will be split into three datasets:
train
for training the initial prediction models (55 % of data)post
for post-processing the initial models with MCBoost (20 %)test
for model evaluation (25 %)To increase the difficulty of the prediction task, we sample from the NHIS data such that the prevalence of demographic subgroups in the test data differs from their prevalence in the training and auditing data. This is achieved by employing weighted sampling from NHIS (variable inv_wt
from above).
set.seed(2953)
test <- nhis_enc %>% slice_sample(prop = 0.25, weight_by = inv_wt)
nontest_g <- nhis_enc %>% anti_join(test, by = "ID")
train_g <- nontest_g %>% slice_sample(prop = 0.75)
post <- nontest_g %>% anti_join(train_g, by = "ID") %>% select(-ID, -svywt, -inv_wt, -c(sex:race))
train <- train_g %>% select(-ID, -svywt, -inv_wt, -c(sex:race), -c(sex2:race3))
As a result, non-hispanic white individuals (hisp2
) are overrepresented and hispanic individuals are underrepresented in both the training and auditing set, compared to their prevalence in the test set.
train_g %>% summarise_at(vars(sex2:race3), mean)
#> sex2 hisp2 hisp3 hisp4 race2 race3
#> 1 0.5118551 0.6252296 0.1416764 0.0399065 0.1477709 0.04449825
# hispanic individuals
1 - sum(train_g %>% summarise_at(vars(hisp2:hisp4), mean))
#> [1] 0.1931875
post %>% summarise_at(vars(sex2:race3), mean)
#> sex2 hisp2 hisp3 hisp4 race2 race3
#> 1 0.5224142 0.6278487 0.1402454 0.03806662 0.1460055 0.04282494
# hispanic individuals
1 - sum(post %>% summarise_at(vars(hisp2:hisp4), mean))
#> [1] 0.1938392
test %>% summarise_at(vars(sex2:race3), mean)
#> sex2 hisp2 hisp3 hisp4 race2 race3
#> 1 0.520759 0.4711629 0.1410859 0.03700921 0.1487883 0.04414804
# hispanic individuals
1 - sum(test %>% summarise_at(vars(hisp2:hisp4), mean))
#> [1] 0.3507421
We train an initial model for predicting healthcare coverage with the training set. Here, we use a neural network with one hidden layer, rather naively with little tweaking.
nnet <- neuralnet(notcov ~ .,
hidden = 5,
linear.output = FALSE,
err.fct = 'ce',
threshold = 0.5,
lifesign = 'full',
data = train
)
We prepare a function that allows us to pass the predictions of the model to MCBoost for post-processing.
init_nnet = function(data) {
predict(nnet, data)[, 2]
}
To showcase different use cases of MCBoost, we prepare two post-processing data sets based on the auditing set. The first set includes only the predictor variables that were used by the initial models, whereas the second set will allow post-processing based on our demographic subgroups of interest (sex, hispanic ethnicity, race).
d1 <- select(post, -c(notcov, sex2:race3))
d2 <- select(post, -notcov)
l <- 1 - one_hot(post$notcov)
We initialize two custom auditors for MCBoost: Ridge regression with a small penalty on model complexity, and a SubpopAuditorFitter
with a fixed set of subpopulations.
ridge = LearnerAuditorFitter$new(lrn("regr.glmnet", alpha = 0, lambda = 2 / nrow(post)))
pops = SubpopAuditorFitter$new(list("sex2", "hisp2", "hisp3", "hisp4", "race2", "race3"))
The ridge regression will only be given access to the initial predictor variables when post-processing the neural net predictions with the auditing data. In contrast, we guide the subpop-fitter to audit the initial predictions explicitly on the outlined subpopulations (sex, hispanic ethnicity, race). In summary, we have:
nnet
: Initial neural netnnet_mc_ridge
: Neural net, post-processed with ridge regression and the initial set of predictor variablesnnet_mc_subpop
: Neural net, post-processed with a fixed set of subpopulations
nnet_mc_ridge = MCBoost$new(init_predictor = init_nnet,
auditor_fitter = ridge,
multiplicative = TRUE,
partition = TRUE,
max_iter = 15)
nnet_mc_ridge$multicalibrate(d1, l)
nnet_mc_subpop = MCBoost$new(init_predictor = init_nnet,
auditor_fitter = pops,
partition = TRUE,
max_iter = 15)
nnet_mc_subpop$multicalibrate(d2, l)
Next, we use the initial and post-processed models to predict the outcome in the test data. We compute predicted probabilities and class predictions.
test$nnet <- predict(nnet, newdata = test)[, 2]
test$nnet_mc_ridge <- nnet_mc_ridge$predict_probs(test)
test$nnet_mc_subpop <- nnet_mc_subpop$predict_probs(test)
test$c_nnet <- round(test$nnet)
test$c_nnet_mc_ridge <- round(test$nnet_mc_ridge)
test$c_nnet_mc_subpop <- round(test$nnet_mc_subpop)
test$label <- 1 - one_hot(test$notcov)
Here we compare the overall accuracy of the initial and post-processed models. Overall, we observe little differences in performance.
mean(test$c_nnet == test$label)
#> [1] 0.8185234
mean(test$c_nnet_mc_ridge == test$label)
#> [1] 0.8237836
mean(test$c_nnet_mc_subpop == test$label)
#> [1] 0.8226564
However, we might be concerned with model performance for smaller subpopulations. In the following, we focus on subgroups defined by 2-way conjunctions of sex, hispanic ethnicity, and race.
test <- test %>%
group_by(sex, hisp) %>%
mutate(sex_hisp = cur_group_id()) %>%
group_by(sex, race) %>%
mutate(sex_race = cur_group_id()) %>%
group_by(hisp, race) %>%
mutate(hisp_race = cur_group_id()) %>%
ungroup()
grouping_vars <- c("sex", "hisp", "race", "sex_hisp", "sex_race", "hisp_race")
eval <- map(grouping_vars, group_by_at, .tbl = test) %>%
map(summarise,
'accuracy_nnet' = mean(c_nnet == label),
'accuracy_nnet_mc_ridge' = mean(c_nnet_mc_ridge == label),
'accuracy_nnet_mc_subpop' = mean(c_nnet_mc_subpop == label),
'size' = n()) %>%
bind_rows()
We evaluate classification accuracy on these subpopulations, and order the results according to the size of the selected subgroups (size
). Subgroup accuracy varies between methods, with MCBoost-Ridge (nnet_mc_ridge
) and MCBoost-Subpop (nnet_mc_subpop
) stabilizing subgroup performance when compared to the initial model, respectively.
eval %>%
arrange(desc(size)) %>%
select(size, accuracy_nnet:accuracy_nnet_mc_subpop) %>%
round(., digits = 3) %>%
formattable(., lapply(1:nrow(eval), function(row) {
area(row, col = 2:4) ~ color_tile("transparent", "lightgreen")
}))
size | accuracy_nnet | accuracy_nnet_mc_ridge | accuracy_nnet_mc_subpop |
---|---|---|---|
4296 | 0.816 | 0.820 | 0.818 |
2772 | 0.814 | 0.819 | 0.820 |
2551 | 0.824 | 0.829 | 0.826 |
2508 | 0.893 | 0.896 | 0.899 |
2508 | 0.893 | 0.896 | 0.899 |
2206 | 0.813 | 0.816 | 0.815 |
2090 | 0.820 | 0.824 | 0.822 |
1867 | 0.715 | 0.720 | 0.713 |
1788 | 0.709 | 0.713 | 0.705 |
1284 | 0.898 | 0.898 | 0.905 |
1224 | 0.888 | 0.894 | 0.893 |
967 | 0.702 | 0.709 | 0.698 |
900 | 0.729 | 0.732 | 0.729 |
792 | 0.837 | 0.850 | 0.852 |
751 | 0.836 | 0.847 | 0.850 |
751 | 0.836 | 0.847 | 0.850 |
440 | 0.823 | 0.839 | 0.850 |
416 | 0.822 | 0.837 | 0.846 |
352 | 0.855 | 0.864 | 0.855 |
335 | 0.854 | 0.860 | 0.854 |
235 | 0.796 | 0.809 | 0.800 |
197 | 0.782 | 0.797 | 0.787 |
197 | 0.782 | 0.797 | 0.787 |
126 | 0.794 | 0.802 | 0.802 |
109 | 0.798 | 0.817 | 0.798 |
105 | 0.781 | 0.790 | 0.790 |
92 | 0.783 | 0.804 | 0.783 |
41 | 0.854 | 0.902 | 0.902 |
38 | 0.868 | 0.868 | 0.868 |
In this scenario, we use a random forest with the default settings of the ranger package as the initial predictor.
rf <- ranger(notcov ~ ., data = train, probability = TRUE)
We again prepare a function to pass the predictions to MCBoost for post-processing.
init_rf = function(data) {
predict(rf, data)$prediction[, 2]
}
We use two custom auditors for MCBoost, i.e., ridge and lasso regression with different penalties on model complexity.
ridge = LearnerAuditorFitter$new(lrn("regr.glmnet", alpha = 0, lambda = 2 / nrow(post)))
lasso = LearnerAuditorFitter$new(lrn("regr.glmnet", alpha = 1, lambda = 40 / nrow(post)))
The ridge regression will only be given access to the initial predictor variables when post-processing the random forest predictions. In contrast, we allow the lasso regression to audit the initial predictions both with the initial predictors and the subpopulations (sex, hispanic ethnicity, race). In summary, we have:
rf
: Initial random forestrf_mc_ridge
: Random forest, post-processed with ridge regression and the initial set of predictor variablesrf_mc_lasso
: Random forest, post-processed with lasso regression and the extended set of predictors
rf_mc_ridge = MCBoost$new(init_predictor = init_rf,
auditor_fitter = ridge,
multiplicative = TRUE,
partition = TRUE,
max_iter = 15)
rf_mc_ridge$multicalibrate(d1, l)
rf_mc_lasso = MCBoost$new(init_predictor = init_rf,
auditor_fitter = lasso,
multiplicative = TRUE,
partition = TRUE,
max_iter = 15)
rf_mc_lasso$multicalibrate(d2, l)
We again compute predicted probabilities and class predictions using the initial and post-processed models.
test$rf <- predict(rf, test)$prediction[, 2]
test$rf_mc_ridge <- rf_mc_ridge$predict_probs(test)
test$rf_mc_lasso <- rf_mc_lasso$predict_probs(test)
test$c_rf <- round(test$rf)
test$c_rf_mc_ridge <- round(test$rf_mc_ridge)
test$c_rf_mc_lasso <- round(test$rf_mc_lasso)
Here we compare the overall accuracy of the initial and post-processed models. As before, we observe small differences in overall performance.
mean(test$c_rf == test$label)
#> [1] 0.8142025
mean(test$c_rf_mc_ridge == test$label)
#> [1] 0.8143904
mean(test$c_rf_mc_lasso == test$label)
#> [1] 0.8132632
However, we might be concerned with calibration in subpopulations. In the following we focus on subgroups defined by 2-way conjunctions of sex, hispanic ethnicity, and race.
eval <- map(grouping_vars, group_by_at, .tbl = test) %>%
map(summarise,
'bias_rf' = abs(mean(rf) - mean(label))*100,
'bias_rf_mc_ridge' = abs(mean(rf_mc_ridge) - mean(label))*100,
'bias_rf_mc_lasso' = abs(mean(rf_mc_lasso) - mean(label))*100,
'size' = n()) %>%
bind_rows()
This evaluation focuses on the difference between the average predicted risk of healthcare non-coverage and the observed proportion of non-coverage in the test data for subgroups. Considering the MCBoost-Ridge (rf_mc_ridge
) and MCBoost-Lasso (rf_mc_lasso
) results, post-processing with MCBoost reduces bias for many subpopulations.
eval %>%
arrange(desc(size)) %>%
select(size, bias_rf:bias_rf_mc_lasso) %>%
round(., digits = 3) %>%
formattable(., lapply(1:nrow(eval), function(row) {
area(row, col = 2:4) ~ color_tile("lightgreen", "transparent")
}))
size | bias_rf | bias_rf_mc_ridge | bias_rf_mc_lasso |
---|---|---|---|
4296 | 2.345 | 3.593 | 3.855 |
2772 | 0.445 | 1.637 | 2.518 |
2551 | 2.484 | 3.782 | 3.406 |
2508 | 4.428 | 2.809 | 1.457 |
2508 | 4.428 | 2.809 | 1.457 |
2206 | 1.386 | 2.589 | 3.486 |
2090 | 3.358 | 4.653 | 4.244 |
1867 | 11.116 | 11.865 | 10.622 |
1788 | 11.847 | 12.572 | 11.306 |
1284 | 5.199 | 3.742 | 1.840 |
1224 | 3.620 | 1.829 | 1.055 |
967 | 9.711 | 10.574 | 10.093 |
900 | 12.625 | 13.252 | 11.190 |
792 | 4.380 | 3.387 | 2.645 |
751 | 4.115 | 3.133 | 2.320 |
751 | 4.115 | 3.133 | 2.320 |
440 | 5.082 | 4.074 | 2.929 |
416 | 4.695 | 3.697 | 2.499 |
352 | 3.503 | 2.528 | 2.290 |
335 | 3.394 | 2.433 | 2.098 |
235 | 4.108 | 6.095 | 5.116 |
197 | 5.157 | 7.256 | 6.264 |
197 | 5.157 | 7.256 | 6.264 |
126 | 3.276 | 4.909 | 4.579 |
109 | 5.070 | 7.465 | 5.736 |
105 | 4.500 | 6.233 | 5.918 |
92 | 5.908 | 8.424 | 6.660 |
41 | 9.241 | 8.037 | 8.592 |
38 | 1.330 | 0.072 | 0.838 |