When creating models, the range of expected outcomes is often just as important as the most likely outcome. For example, a prediction that a house will have a price of $250,000 +/- $10,000 has a vastly different interpretation than a prediction that a house will have a price of $250,000 +/- $50,000! Some models (like linear models) can output both point predictions and confidence intervals (N.B. this is actually different than a prediction interval) around each prediction but other — often more powerful — models can only output point predictions.
This is where bootstrap resampling can help! Creating n
resamples of the original dataset allows us to create n
models — one for each resample. These many models can then be used to predict on new data and create a distribution of expected outcomes for each prediction.
{workboots}
is a tidy implementation of this solution written around the core function predict_boots()
. Pass an untrained workflow object to predict_boots()
to return a tibble of nested predictions for each observation.
Let’s work through a motivating example of predicting a penguin’s weight (body_mass_g
) from other characteristics using the Palmer Penguins dataset.
library(tidymodels)
data("penguins")
<-
penguins %>%
penguins drop_na()
penguins#> # A tibble: 333 x 7
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> 4 Adelie Torgersen 36.7 19.3 193 3450
#> 5 Adelie Torgersen 39.3 20.6 190 3650
#> 6 Adelie Torgersen 38.9 17.8 181 3625
#> 7 Adelie Torgersen 39.2 19.6 195 4675
#> 8 Adelie Torgersen 41.1 17.6 182 3200
#> 9 Adelie Torgersen 38.6 21.2 191 3800
#> 10 Adelie Torgersen 34.6 21.1 198 4400
#> # ... with 323 more rows, and 1 more variable: sex <fct>
XGBoost is a powerful model architecture, but only can generate point predictions. To generate single estimates for each penguin’s weight, we can create and fit a workflow (some useful resources for how to use {tidymodels}
include the tidymodels package site and the book Tidy Modeling with R).
# split data into training and testing sets
set.seed(123)
<- initial_split(penguins)
penguins_split <- testing(penguins_split)
penguins_test <- training(penguins_split)
penguins_train
# create a workflow
<-
penguins_wf workflow() %>%
# add preprocessing steps
add_recipe(
recipe(body_mass_g ~ ., data = penguins_train) %>%
step_dummy(all_nominal_predictors())
%>%
)
# add xgboost model specification
add_model(
boost_tree("regression") %>% set_engine("xgboost")
)
# fit to training data & predict on test data
set.seed(234)
<-
penguins_preds %>%
penguins_wf fit(penguins_train) %>%
predict(penguins_test)
As mentioned above, XGBoost models only generate point predictions.
%>%
penguins_preds bind_cols(penguins_test) %>%
ggplot(aes(x = body_mass_g,
y = .pred)) +
geom_point() +
geom_segment(aes(x = 3000, xend = 6000,
y = 3000, yend = 6000),
linetype = "dashed",
color = "gray") +
labs(title = "Single XGBoost Model Predictions")
{workboots}
to Generate Prediction IntervalsWith {workboots}
, however, we can generate a prediction distribution for each penguin’s weight in the test set! To do so, we’ll pass our workflow to predict_boots()
, which will return a nested tibble with a set of predictions for each penguin in the penguins_test
set.
library(workboots)
# create 100 models from bootstrap resamples and make predictions on the test set
set.seed(345)
<-
penguins_preds_boot %>%
penguins_wf predict_boots(
n = 100,
training_data = penguins_train,
new_data = penguins_test
)
penguins_preds_boot#> # A tibble: 84 x 2
#> rowid .preds
#> <int> <list>
#> 1 1 <tibble [100 x 2]>
#> 2 2 <tibble [100 x 2]>
#> 3 3 <tibble [100 x 2]>
#> 4 4 <tibble [100 x 2]>
#> 5 5 <tibble [100 x 2]>
#> 6 6 <tibble [100 x 2]>
#> 7 7 <tibble [100 x 2]>
#> 8 8 <tibble [100 x 2]>
#> 9 9 <tibble [100 x 2]>
#> 10 10 <tibble [100 x 2]>
#> # ... with 74 more rows
From each set of nested predictions, we can summarize with a lower and upper bound of our prediction interval (this uses the quantile()
function under the hood).
%>%
penguins_preds_boot summarise_predictions()
#> # A tibble: 84 x 5
#> rowid .preds .pred_lower .pred .pred_upper
#> <int> <list> <dbl> <dbl> <dbl>
#> 1 1 <tibble [100 x 2]> 3296. 3469. 3799.
#> 2 2 <tibble [100 x 2]> 3307. 3528. 3825.
#> 3 3 <tibble [100 x 2]> 3369. 3617. 3913.
#> 4 4 <tibble [100 x 2]> 3799. 4129. 4492.
#> 5 5 <tibble [100 x 2]> 3662. 3899. 4102.
#> 6 6 <tibble [100 x 2]> 3258. 3522. 3819.
#> 7 7 <tibble [100 x 2]> 3281. 3450. 3582.
#> 8 8 <tibble [100 x 2]> 3736. 4073. 4340.
#> 9 9 <tibble [100 x 2]> 3221. 3453. 3616.
#> 10 10 <tibble [100 x 2]> 3195. 3388. 3611.
#> # ... with 74 more rows
This allows us to include a prediction interval along with our point predictions!
%>%
penguins_preds_boot summarise_predictions() %>%
bind_cols(penguins_test) %>%
ggplot(aes(x = body_mass_g,
y = .pred,
ymin = .pred_lower,
ymax = .pred_upper)) +
geom_abline(linetype = "dashed",
color = "gray") +
geom_errorbar(alpha = 0.5) +
geom_point(alpha = 0.5) +
labs(title = "XGBoost Model Prediction Intervals from Bootstrap Resampling",
subtitle = "Error bars represent the 2.5/97.5% quantiles")