MCBoost - Health Survey Example • mcboost

library(tidyverse)
library(PracTools)
library(ranger)
library(neuralnet)
library(formattable)
library(mlr3)
library(mlr3learners)
library(mcboost)

Data and Setup

This vignette presents two typical use cases of MCBoost with data from a health survey. The goal is to post-process two initial prediction models for multi-accuracy using different flavors of MCBoost, and to eventually compare the naive and post-processed predictors overall and for subpopulations. The first scenario starts with a neural net and, as an example, evaluates the initial and post-processed predictors with a focus on subgroup accuracy after running MCBoost. The second scenario uses a random forest and evaluates the initial and post-processed predictors with respect to subgroup calibration.

We use data derived from the National Health Interview Survey (NHIS 2003), which includes demographic and health-related variables for 21,588 individuals. This data can directly be included from the PracTools package.

data(nhis.large)

We can obtain more information using:

?nhis.large

In the following, our outcome of interest is whether an individual is covered by any type of health insurance (notcov, 1 = not covered, 0 = covered). We additionally prepare two sets of variables:

Predictor variables (age, parents in household, education, income, employment status, physical or other limitations)
Subpopulation variables (sex, hispanic ethnicity, race)

The second set of variables will not be used for training the initial prediction models, but will be our focus when it comes to evaluating prediction performance for subgroups.

Before we training an initial model, we preprocess the data:

We encode categorical features as factor.
We explicitly assign NAs in categorical features to a dedicated factor level
We drop NAs in the outcome variable notcov
We encode notcov as a factor variable instead of a dummy variable (1 = notcov, 0 = cov)
We create a new feature inv_wt as the inverse of survey weights svwyt

categorical <- c("age.grp", "parents", "educ", "inc.grp", "doing.lw",
  "limited", "sex", "hisp", "race")

nhis <- nhis.large %>%
  mutate_at(categorical, as.factor) %>%
  mutate_at(categorical, fct_explicit_na) %>%
  drop_na(notcov) %>%
  select(all_of(categorical), notcov, svywt, ID)

nhis$notcov <- factor(ifelse(nhis$notcov == 1, "notcov", "cov"))

nhis_enc <- data.frame(model.matrix(notcov ~ ., data = nhis)[,-1])
nhis_enc$notcov <- nhis$notcov
nhis_enc$sex <- nhis$sex
nhis_enc$hisp <- nhis$hisp
nhis_enc$race <- nhis$race
nhis_enc$inv_wt <- (1 / nhis$svywt)

The pre-processed NHIS data will be split into three datasets:

A training set train for training the initial prediction models (55 % of data)
An auditing set post for post-processing the initial models with MCBoost (20 %)
A test set testfor model evaluation (25 %)

To increase the difficulty of the prediction task, we sample from the NHIS data such that the prevalence of demographic subgroups in the test data differs from their prevalence in the training and auditing data. This is achieved by employing weighted sampling from NHIS (variable inv_wt from above).

set.seed(2953)

test <- nhis_enc %>% slice_sample(prop = 0.25, weight_by = inv_wt)

nontest_g <- nhis_enc %>% anti_join(test, by = "ID")

train_g <- nontest_g %>% slice_sample(prop = 0.75)

post <- nontest_g %>% anti_join(train_g, by = "ID") %>% select(-ID, -svywt, -inv_wt, -c(sex:race))

train <- train_g %>% select(-ID, -svywt, -inv_wt, -c(sex:race), -c(sex2:race3))

As a result, non-hispanic white individuals (hisp2) are overrepresented and hispanic individuals are underrepresented in both the training and auditing set, compared to their prevalence in the test set.

train_g %>% summarise_at(vars(sex2:race3), mean)
#>        sex2     hisp2     hisp3     hisp4     race2      race3
#> 1 0.5118551 0.6252296 0.1416764 0.0399065 0.1477709 0.04449825
# hispanic individuals
1 - sum(train_g %>% summarise_at(vars(hisp2:hisp4), mean))
#> [1] 0.1931875

post %>% summarise_at(vars(sex2:race3), mean)
#>        sex2     hisp2     hisp3      hisp4     race2      race3
#> 1 0.5224142 0.6278487 0.1402454 0.03806662 0.1460055 0.04282494
# hispanic individuals
1 - sum(post %>% summarise_at(vars(hisp2:hisp4), mean))
#> [1] 0.1938392

test %>% summarise_at(vars(sex2:race3), mean)
#>       sex2     hisp2     hisp3      hisp4     race2      race3
#> 1 0.520759 0.4711629 0.1410859 0.03700921 0.1487883 0.04414804
# hispanic individuals
1 - sum(test %>% summarise_at(vars(hisp2:hisp4), mean))
#> [1] 0.3507421

Scenario 1: Improve Subgroup Accuracy

We train an initial model for predicting healthcare coverage with the training set. Here, we use a neural network with one hidden layer, rather naively with little tweaking.

nnet <- neuralnet(notcov ~ .,
  hidden = 5,
  linear.output = FALSE,
  err.fct = 'ce',
  threshold = 0.5,
  lifesign = 'full',
  data = train
)

MCBoost Auditing

We prepare a function that allows us to pass the predictions of the model to MCBoost for post-processing.

init_nnet = function(data) {
  predict(nnet, data)[, 2]
}

To showcase different use cases of MCBoost, we prepare two post-processing data sets based on the auditing set. The first set includes only the predictor variables that were used by the initial models, whereas the second set will allow post-processing based on our demographic subgroups of interest (sex, hispanic ethnicity, race).

d1 <- select(post, -c(notcov, sex2:race3))
d2 <- select(post, -notcov)
l <- 1 - one_hot(post$notcov)

We initialize two custom auditors for MCBoost: Ridge regression with a small penalty on model complexity, and a SubpopAuditorFitter with a fixed set of subpopulations.

ridge = LearnerAuditorFitter$new(lrn("regr.glmnet", alpha = 0, lambda = 2 / nrow(post)))

pops = SubpopAuditorFitter$new(list("sex2", "hisp2", "hisp3", "hisp4", "race2", "race3"))

The ridge regression will only be given access to the initial predictor variables when post-processing the neural net predictions with the auditing data. In contrast, we guide the subpop-fitter to audit the initial predictions explicitly on the outlined subpopulations (sex, hispanic ethnicity, race). In summary, we have:

nnet: Initial neural net
nnet_mc_ridge: Neural net, post-processed with ridge regression and the initial set of predictor variables
nnet_mc_subpop: Neural net, post-processed with a fixed set of subpopulations

nnet_mc_ridge = MCBoost$new(init_predictor = init_nnet,
                            auditor_fitter = ridge,
                            multiplicative = TRUE,
                            partition = TRUE,
                            max_iter = 15)
nnet_mc_ridge$multicalibrate(d1, l)

nnet_mc_subpop = MCBoost$new(init_predictor = init_nnet,
                             auditor_fitter = pops,
                             partition = TRUE,
                             max_iter = 15)
nnet_mc_subpop$multicalibrate(d2, l)

Model Evaluation

Next, we use the initial and post-processed models to predict the outcome in the test data. We compute predicted probabilities and class predictions.

test$nnet <- predict(nnet, newdata = test)[, 2]
test$nnet_mc_ridge <- nnet_mc_ridge$predict_probs(test)
test$nnet_mc_subpop <- nnet_mc_subpop$predict_probs(test)

test$c_nnet <- round(test$nnet)
test$c_nnet_mc_ridge <- round(test$nnet_mc_ridge)
test$c_nnet_mc_subpop <- round(test$nnet_mc_subpop)
test$label <- 1 - one_hot(test$notcov)

Here we compare the overall accuracy of the initial and post-processed models. Overall, we observe little differences in performance.

mean(test$c_nnet == test$label)
#> [1] 0.8185234
mean(test$c_nnet_mc_ridge == test$label)
#> [1] 0.8237836
mean(test$c_nnet_mc_subpop == test$label)
#> [1] 0.8226564

However, we might be concerned with model performance for smaller subpopulations. In the following, we focus on subgroups defined by 2-way conjunctions of sex, hispanic ethnicity, and race.

test <- test %>%
  group_by(sex, hisp) %>%
  mutate(sex_hisp = cur_group_id()) %>%
  group_by(sex, race) %>%
  mutate(sex_race = cur_group_id()) %>%
  group_by(hisp, race) %>%
  mutate(hisp_race = cur_group_id()) %>%
  ungroup()

grouping_vars <- c("sex", "hisp", "race", "sex_hisp", "sex_race", "hisp_race")

eval <- map(grouping_vars, group_by_at, .tbl = test) %>%
  map(summarise,
      'accuracy_nnet' = mean(c_nnet == label),
      'accuracy_nnet_mc_ridge' = mean(c_nnet_mc_ridge == label),
      'accuracy_nnet_mc_subpop' = mean(c_nnet_mc_subpop == label),
      'size' = n()) %>%
  bind_rows()

We evaluate classification accuracy on these subpopulations, and order the results according to the size of the selected subgroups (size). Subgroup accuracy varies between methods, with MCBoost-Ridge (nnet_mc_ridge) and MCBoost-Subpop (nnet_mc_subpop) stabilizing subgroup performance when compared to the initial model, respectively.

eval %>%
  arrange(desc(size)) %>%
  select(size, accuracy_nnet:accuracy_nnet_mc_subpop) %>%
  round(., digits = 3) %>%
  formattable(., lapply(1:nrow(eval), function(row) {
  area(row, col = 2:4) ~ color_tile("transparent", "lightgreen")
    }))

size	accuracy_nnet	accuracy_nnet_mc_ridge	accuracy_nnet_mc_subpop
4296	0.816	0.820	0.818
2772	0.814	0.819	0.820
2551	0.824	0.829	0.826
2508	0.893	0.896	0.899
2508	0.893	0.896	0.899
2206	0.813	0.816	0.815
2090	0.820	0.824	0.822
1867	0.715	0.720	0.713
1788	0.709	0.713	0.705
1284	0.898	0.898	0.905
1224	0.888	0.894	0.893
967	0.702	0.709	0.698
900	0.729	0.732	0.729
792	0.837	0.850	0.852
751	0.836	0.847	0.850
751	0.836	0.847	0.850
440	0.823	0.839	0.850
416	0.822	0.837	0.846
352	0.855	0.864	0.855
335	0.854	0.860	0.854
235	0.796	0.809	0.800
197	0.782	0.797	0.787
197	0.782	0.797	0.787
126	0.794	0.802	0.802
109	0.798	0.817	0.798
105	0.781	0.790	0.790
92	0.783	0.804	0.783
41	0.854	0.902	0.902
38	0.868	0.868	0.868

Scenario 2: Improve Subgroup Calibration

In this scenario, we use a random forest with the default settings of the ranger package as the initial predictor.

rf <- ranger(notcov ~ ., data = train, probability = TRUE)

MCBoost Auditing

We again prepare a function to pass the predictions to MCBoost for post-processing.

init_rf = function(data) {
  predict(rf, data)$prediction[, 2]
}

We use two custom auditors for MCBoost, i.e., ridge and lasso regression with different penalties on model complexity.

ridge = LearnerAuditorFitter$new(lrn("regr.glmnet", alpha = 0, lambda = 2 / nrow(post)))

lasso = LearnerAuditorFitter$new(lrn("regr.glmnet", alpha = 1, lambda = 40 / nrow(post)))

The ridge regression will only be given access to the initial predictor variables when post-processing the random forest predictions. In contrast, we allow the lasso regression to audit the initial predictions both with the initial predictors and the subpopulations (sex, hispanic ethnicity, race). In summary, we have:

rf: Initial random forest
rf_mc_ridge: Random forest, post-processed with ridge regression and the initial set of predictor variables
rf_mc_lasso: Random forest, post-processed with lasso regression and the extended set of predictors

rf_mc_ridge = MCBoost$new(init_predictor = init_rf,
                          auditor_fitter = ridge,
                          multiplicative = TRUE,
                          partition = TRUE,
                          max_iter = 15)
rf_mc_ridge$multicalibrate(d1, l)

rf_mc_lasso = MCBoost$new(init_predictor = init_rf,
                          auditor_fitter = lasso,
                          multiplicative = TRUE,
                          partition = TRUE,
                          max_iter = 15)
rf_mc_lasso$multicalibrate(d2, l)

Model Evaluation

We again compute predicted probabilities and class predictions using the initial and post-processed models.

test$rf <- predict(rf, test)$prediction[, 2]
test$rf_mc_ridge <- rf_mc_ridge$predict_probs(test)
test$rf_mc_lasso <- rf_mc_lasso$predict_probs(test)

test$c_rf <- round(test$rf)
test$c_rf_mc_ridge <- round(test$rf_mc_ridge)
test$c_rf_mc_lasso <- round(test$rf_mc_lasso)

Here we compare the overall accuracy of the initial and post-processed models. As before, we observe small differences in overall performance.

mean(test$c_rf == test$label)
#> [1] 0.8142025
mean(test$c_rf_mc_ridge == test$label)
#> [1] 0.8143904
mean(test$c_rf_mc_lasso == test$label)
#> [1] 0.8132632

However, we might be concerned with calibration in subpopulations. In the following we focus on subgroups defined by 2-way conjunctions of sex, hispanic ethnicity, and race.

eval <- map(grouping_vars, group_by_at, .tbl = test) %>%
  map(summarise,
      'bias_rf' = abs(mean(rf) - mean(label))*100,
      'bias_rf_mc_ridge' = abs(mean(rf_mc_ridge) - mean(label))*100,
      'bias_rf_mc_lasso' = abs(mean(rf_mc_lasso) - mean(label))*100,
      'size' = n()) %>%
  bind_rows()

This evaluation focuses on the difference between the average predicted risk of healthcare non-coverage and the observed proportion of non-coverage in the test data for subgroups. Considering the MCBoost-Ridge (rf_mc_ridge) and MCBoost-Lasso (rf_mc_lasso) results, post-processing with MCBoost reduces bias for many subpopulations.

eval %>%
  arrange(desc(size)) %>%
  select(size, bias_rf:bias_rf_mc_lasso) %>%
  round(., digits = 3) %>%
  formattable(., lapply(1:nrow(eval), function(row) {
  area(row, col = 2:4) ~ color_tile("lightgreen", "transparent")
    }))

size	bias_rf	bias_rf_mc_ridge	bias_rf_mc_lasso
4296	2.345	3.593	3.855
2772	0.445	1.637	2.518
2551	2.484	3.782	3.406
2508	4.428	2.809	1.457
2508	4.428	2.809	1.457
2206	1.386	2.589	3.486
2090	3.358	4.653	4.244
1867	11.116	11.865	10.622
1788	11.847	12.572	11.306
1284	5.199	3.742	1.840
1224	3.620	1.829	1.055
967	9.711	10.574	10.093
900	12.625	13.252	11.190
792	4.380	3.387	2.645
751	4.115	3.133	2.320
751	4.115	3.133	2.320
440	5.082	4.074	2.929
416	4.695	3.697	2.499
352	3.503	2.528	2.290
335	3.394	2.433	2.098
235	4.108	6.095	5.116
197	5.157	7.256	6.264
197	5.157	7.256	6.264
126	3.276	4.909	4.579
109	5.070	7.465	5.736
105	4.500	6.233	5.918
92	5.908	8.424	6.660
41	9.241	8.037	8.592
38	1.330	0.072	0.838