Type: Package
Title: Interpretable Boosted Linear Models
Version: 1.0.1
Description: Implements Interpretable Boosted Linear Models (IBLMs). These combine a conventional generalized linear model (GLM) with a machine learning component, such as XGBoost. The package also provides tools within for explaining and analyzing these models. For more details see Gawlowski and Wang (2025) https://ifoa-adswp.github.io/IBLM/reference/figures/iblm_paper.pdf.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Depends: R (≥ 4.1.0)
RoxygenNote: 7.3.0
Imports: cli, dplyr, fastDummies, ggExtra, ggplot2, purrr, scales, statmod, stats, utils, withr, xgboost
Suggests: testthat (≥ 3.0.0)
Config/testthat/edition: 3
URL: https://ifoa-adswp.github.io/IBLM/, https://github.com/IFoA-ADSWP/IBLM
BugReports: https://github.com/IFoA-ADSWP/IBLM/issues
NeedsCompilation: no
Packaged: 2025-11-15 20:02:43 UTC; paulb
Author: Karol Gawlowski [aut, cre, cph], Paul Beard [aut]
Maintainer: Karol Gawlowski <Karol.Gawlowski@citystgeorges.ac.uk>
Repository: CRAN
Date/Publication: 2025-11-19 19:50:25 UTC

Density Plot of Beta Corrections for a Variable

Description

Generates a density plot showing the distribution of corrected Beta values to a GLM coefficient, along with the original Beta coefficient, and standard error bounds around it.

NOTE This function signature documents the interface of functions created by create_beta_corrected_density.

Usage

beta_corrected_density(varname, q = 0.05, type = "kde")

Arguments

varname

Character string specifying the variable name OR coefficient name is accepted as well.

q

Number, must be between 0 and 0.5. Determines the quantile range of the plot (i.e. value of 0.05 will only show shaps within 5pct –> 95pct quantile range for plot)

type

Character string, must be "kde" or "hist"

Details

The plot shows:

Value

ggplot object(s) showing the density distribution of corrected beta coefficients with vertical lines indicating the original coefficient value and standard error bounds.

The item returned will be:

See Also

create_beta_corrected_density, explain_iblm

Examples

# This function is created inside explain_iblm() and is output as an item

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

explain_objects <- explain_iblm(iblm_model, df_list$test)

# plot can be for a categorical variable (produces list of plots, one for each level)
explain_objects$beta_corrected_density(varname = "Area")

# plot can be for a single categorical level
explain_objects$beta_corrected_density(varname = "AreaB")

# output can be numerical variable
explain_objects$beta_corrected_density(varname = "DrivAge")


# This function must be created, and cannot be called directly from the package
try(
beta_corrected_density(varname = "DrivAge")
)

Scatter Plot of Beta Corrections for a Variable

Description

Generates a scatter plot or boxplot showing SHAP corrections for a specified variable from a fitted model. For numerical variables, creates a scatter plot with optional coloring and marginal densities. For categorical variables, creates a boxplot with model coefficients overlaid.

NOTE This function signature documents the interface of functions created by create_beta_corrected_scatter.

Usage

beta_corrected_scatter(varname, q = 0, color = NULL, marginal = FALSE)

Arguments

varname

Character. Name of the variable to plot SHAP corrections for. Must be present in the fitted model.

q

Numeric. Quantile threshold for outlier removal. When 0 (default) the function will not remove any outliers

color

Character or NULL. Name of variable to use for point coloring. Must be present in the model. Currently not supported for categorical variables.

marginal

Logical. Whether to add marginal density plots (numerical variables only).

Details

The function handles both numerical and categorical variables differently:

For numerical variables, horizontal lines show the model coefficient (solid) and confidence intervals (dashed). SHAP corrections represent local deviations from the global model coefficient.

Value

A ggplot2 object. For numerical variables: scatter plot with SHAP corrections, model coefficient line, and confidence bands. For categorical variables: boxplot with coefficient points overlaid.

See Also

create_beta_corrected_scatter, explain_iblm

Examples

# This function is created inside explain_iblm() and is output as an item

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

explain_objects <- explain_iblm(iblm_model, df_list$test)

# plot can be for a categorical variable (box plot)
explain_objects$beta_corrected_scatter(varname = "Area")

# plot can be for a numerical variable (scatter plot)
explain_objects$beta_corrected_scatter(varname = "DrivAge")


# This function must be created, and cannot be called directly from the package
try(
beta_corrected_scatter(varname = "DrivAge")
)

Compute Beta Corrections based on SHAP values

Description

Processes SHAP values in one-hot (wide) format to create beta coefficient corrections.

This includes:

Usage

beta_corrections_derive(
  shap_wide,
  wide_input_frame,
  iblm_model,
  migrate_reference_to_bias = TRUE
)

Arguments

shap_wide

Data frame containing SHAP values from XGBoost that have been converted to wide format by [shap_to_onehot()]

wide_input_frame

Wide format input data frame (one-hot encoded).

iblm_model

Object of class 'iblm'

migrate_reference_to_bias

Logical, migrate the beta corrections to the bias for reference levels? This applied to categorical vars only. It is recommended to leave this setting on TRUE

Value

A data frame with the booster model beta corrections in one-hot (wide) format

Examples

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

shap <- stats::predict(
  iblm_model$booster_model,
  newdata = xgboost::xgb.DMatrix(
    data.matrix(
      dplyr::select(df_list$test, -dplyr::all_of(iblm_model$response_var))
    )
  ),
  predcontrib = TRUE
) |> data.frame()

wide_input_frame <- data_to_onehot(df_list$test, iblm_model)

shap_wide <- shap_to_onehot(shap, wide_input_frame, iblm_model)

beta_corrections <- beta_corrections_derive(shap_wide, wide_input_frame, iblm_model)

beta_corrections |> dplyr::glimpse()


Density Plot of Bias Corrections from SHAP values

Description

Visualizes the distribution of SHAP corrections that are migrated to bias terms, showing both per-variable and total bias corrections.

NOTE This function signature documents the interface of functions created by create_bias_density.

Usage

bias_density(q = 0, type = "hist")

Arguments

q

Numeric value between 0 and 0.5 for quantile bounds. A higher number will trim more from the edges (useful if outliers are distorting your plot window) Default is 0 (i.e. no trimming)

type

Character string specifying plot type: "kde" for kernel density or "hist" for histogram. Default is "hist".

Value

A list with two ggplot objects:

See Also

create_bias_density, explain_iblm

Examples

# This function is created inside explain_iblm() and is output as an item

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

explain_objects <- explain_iblm(iblm_model, df_list$test)

explain_objects$bias_density()


# This function must be created, and cannot be called directly from the package
try(
bias_density()
)

Check Data Variability for Modeling

Description

Validates that the response variable and all predictor variables have more than one unique value.

Usage

check_data_variability(data, response_var)

Arguments

data

A data frame containing the variables to check.

response_var

Character string naming the response variable in 'data'.

Value

Invisibly returns 'TRUE' if all checks pass, otherwise throws an error.


Check Object of Class 'iblm'

Description

Validates an iblm model object has required structure and features

Usage

check_iblm_model(model, booster_models_supported = c("xgb.Booster"))

Arguments

model

Model object to validate, expected class "iblm"

booster_models_supported

Booster model classes currently supported in the iblm package

Value

Invisible TRUE if all checks pass

Examples

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

check_iblm_model(iblm_model)


Plot GLM vs IBLM Predictions with Different Corridors

Description

Creates a faceted scatter plot comparing GLM predictions to ensemble predictions across different trim values, showing how the ensemble corrects the base GLM model.

Usage

correction_corridor(
  iblm_model,
  data,
  trim_vals = c(NA_real_, 4, 1, 0.2, 0.1, 0),
  sample_perc = 0.2,
  color = NA,
  ...
)

Arguments

iblm_model

An IBLM model object of class "iblm".

data

Data frame. If you have used 'split_into_train_validate_test()' this will usually be the "test" portion of your data.

trim_vals

Numeric vector of trim values to compare. The length of this vector will dictate the no. of facets shown in plot output

sample_perc

Proportion of data to randomly sample for plotting (0 to 1). Default is 0.2 to improve performance with large datasets

color

Optional. Name of a variable in 'data' to color points by

...

Additional arguments passed to 'geom_point()'

Value

A ggplot object showing GLM vs IBLM predictions faceted by trim value. The diagonal line (y = x) represents perfect agreement between models

Examples

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

correction_corridor(iblm_model = iblm_model, data = df_list$test, color = "DrivAge")


Create Pre-Configured Beta Corrected Density Plot Function

Description

Factory function that returns a plotting function with data pre-configured.

Usage

create_beta_corrected_density(
  wide_input_frame,
  beta_corrections,
  data,
  iblm_model
)

Arguments

wide_input_frame

Dataframe. Wide format input data (one-hot encoded).

beta_corrections

Dataframe. Output from beta_corrections_derive.

data

Dataframe. The testing data.

iblm_model

Object of class 'iblm'.

Value

Function with signature function(varname, q = 0.05, type = "kde").

See Also

[beta_corrected_density()]

Examples

# ------- prepare iblm objects required -------

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

test_data <- df_list$test
shap <- extract_booster_shap(iblm_model$booster_model, test_data)
wide_input_frame <- data_to_onehot(test_data, iblm_model)
shap_wide <- shap_to_onehot(shap, wide_input_frame, iblm_model)
beta_corrections <- beta_corrections_derive(shap_wide, wide_input_frame, iblm_model)

# ------- demonstration of functionality -------

# create_beta_corrected_density() can create function of type 'beta_corrected_density'
my_beta_corrected_density <- create_beta_corrected_density(
  wide_input_frame,
  beta_corrections,
  test_data,
  iblm_model
)

# this custom function then acts as per beta_corrected_density()
my_beta_corrected_density(varname = "VehAge")


Create Pre-Configured Beta Corrected Scatter Plot Function

Description

Factory function that returns a plotting function with data pre-configured.

Usage

create_beta_corrected_scatter(data_beta_coeff, data, iblm_model)

Arguments

data_beta_coeff

Dataframe. Contains the corrected beta coefficients for each row of the data.

data

Dataframe. The testing data.

iblm_model

Object of class 'iblm'.

Value

Function with signature function(varname, q = 0, color = NULL, marginal = FALSE).

See Also

[beta_corrected_scatter()]

Examples

# ------- prepare iblm objects required -------

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

test_data <- df_list$test
shap <- extract_booster_shap(iblm_model$booster_model, test_data)
wide_input_frame <- data_to_onehot(test_data, iblm_model)
shap_wide <- shap_to_onehot(shap, wide_input_frame, iblm_model)
beta_corrections <- beta_corrections_derive(shap_wide, wide_input_frame, iblm_model)
data_glm <- data_beta_coeff_glm(test_data, iblm_model)
data_booster <- data_beta_coeff_booster(test_data, beta_corrections, iblm_model)
data_beta_coeff <- data_glm + data_booster

# ------- demonstration of functionality -------

# create_beta_corrected_scatter() can create function of type 'beta_corrected_scatter'
my_beta_corrected_scatter <- create_beta_corrected_scatter(data_beta_coeff, test_data, iblm_model)

# this custom function then acts as per beta_corrected_scatter()
my_beta_corrected_scatter(varname = "VehAge")


Create Pre-Configured Bias Density Plot Function

Description

Factory function that returns a plotting function with data pre-configured.

Usage

create_bias_density(shap, data, iblm_model, migrate_reference_to_bias = TRUE)

Arguments

shap

Dataframe. Contains raw SHAP values.

data

Dataframe. The testing data.

iblm_model

Object of class 'iblm'.

migrate_reference_to_bias

TRUE/FALSE determines whether the shap values of categorical reference levels be migrated to the bias? Default is TRUE

Value

Function with signature function(q = 0, type = "hist").

See Also

[bias_density()]

Examples

# ------- prepare iblm objects required -------

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

test_data <- df_list$test
shap <- extract_booster_shap(iblm_model$booster_model, test_data)

# ------- demonstration of functionality -------

# create_bias_density() can create function of type 'bias_density'
my_bias_density <- create_bias_density(shap, test_data, iblm_model)

# this custom function then acts as per bias_density()
my_bias_density()



Create Pre-Configured Overall Correction Plot Function

Description

Factory function that returns a plotting function with data pre-configured.

Usage

create_overall_correction(shap, iblm_model)

Arguments

shap

Dataframe. Contains raw SHAP values.

iblm_model

Object of class 'iblm'.

Value

Function with signature function(transform_x_scale_by_link = TRUE).

See Also

[overall_correction()]

Examples

# ------- prepare iblm objects required -------

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

test_data <- df_list$test
shap <- extract_booster_shap(iblm_model$booster_model, test_data)

# ------- demonstration of functionality -------

# create_overall_correction() can create function of type 'overall_correction'
my_overall_correction <- create_overall_correction(shap, iblm_model)

# this custom function then acts as per overall_correction()
my_overall_correction()


Obtain Booster Model Beta Corrections for tabular data

Description

Creates dataframe of Shap beta corrections for each row and predictor variable of 'data'

Usage

data_beta_coeff_booster(data, beta_corrections, iblm_model)

Arguments

data

A data frame containing the dataset for analysis

beta_corrections

A data frame or matrix containing beta correction values for all variables and bias

iblm_model

Object of class 'iblm'

Value

A data frame with beta coefficient corrections. The structure will be the same dimension as 'data' except for a "bias" column at the start.

Examples

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

explainer_outputs <- explain_iblm(iblm_model, df_list$test)

data_booster <- data_beta_coeff_booster(
  df_list$test,
  explainer_outputs$beta_corrections,
  iblm_model
)

data_booster |> dplyr::glimpse()


Obtain GLM Beta Coefficients for tabular data

Description

Creates dataframe of GLM beta coefficients for each row and predictor variable of 'data'

Usage

data_beta_coeff_glm(data, iblm_model)

Arguments

data

Data frame with predictor variables

iblm_model

Object of class 'iblm'

Value

A data frame with beta coefficients. The structure will be the same dimension as 'data' except for a "bias" column at the start.

Examples

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

data_glm <- data_beta_coeff_glm(df_list$train, iblm_model)

data_glm |> dplyr::glimpse()


Convert Data Frame to Wide One-Hot Encoded Format

Description

Transforms categorical variables in a data frame into one-hot encoded format

Usage

data_to_onehot(data, iblm_model, remove_target = TRUE)

Arguments

data

Input data frame to be transformed. This will typically be the "train" data subset

iblm_model

Object of class 'iblm'

remove_target

Logical, whether to remove the response_var variable from the output (default TRUE).

Value

A data frame in wide format with one-hot encoded categorical variables, an intercept column, and all variables ordered according to "coeff_names$all" from 'iblm_model'

Examples

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

wide_input_frame <- data_to_onehot(df_list$test, iblm_model)

wide_input_frame |> dplyr::glimpse()


Explain GLM Model Predictions Using SHAP Values

Description

Creates a list that explains the beta values, and their corrections, of the ensemble IBLM model

Usage

explain_iblm(iblm_model, data, migrate_reference_to_bias = TRUE)

Arguments

iblm_model

An object of class 'iblm'. This should be output by 'train_iblm_xgb()'

data

Data frame.

If you have used 'split_into_train_validate_test()' this will be the "test" portion of your data.

migrate_reference_to_bias

Logical, migrate the beta corrections to the bias for reference levels? This applied to categorical vars only. It is recommended to leave this setting on TRUE

Details

The following outputs are functions that can be called to create plots:

For each of these, the key data arguments (e.g. data, shap, iblm_model) are already populated by 'explain_iblm()'. When calling these functions output from 'explain_iblm()' only key settings like variable names, colours...etc need populating.

Value

A list containing:

beta_corrected_scatter

Function to create scatter plots showing SHAP corrections vs variable values (see beta_corrected_scatter)

beta_corrected_density

Function to create density plots of SHAP corrections for variables (see beta_corrected_density)

bias_density

Function to create density plots of SHAP corrections migrated to bias (see bias_density)

overall_correction

Function to show global correction distributions (see overall_correction)

shap

Dataframe showing raw SHAP values of data records

beta_corrections

Dataframe showing beta corrections (in wide/one-hot format) of data records

data_beta_coeff

Dataframe showing beta coefficients of data records

Examples

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

ex <- explain_iblm(iblm_model, df_list$test)

# the output contains functions that can be called to visualise iblm
ex$beta_corrected_scatter("DrivAge")
ex$beta_corrected_density("DrivAge")
ex$overall_correction()
ex$bias_density()

# the output contains also dataframes
ex$shap |> dplyr::glimpse()
ex$beta_corrections |> dplyr::glimpse()
ex$data_beta_coeff |> dplyr::glimpse()


Extract SHAP values from an xgboost Booster model

Description

A function to extract SHAP (SHapley Additive exPlanations) values from fitted booster model

Usage

extract_booster_shap(booster_model, data, ...)

## S3 method for class 'xgb.Booster'
extract_booster_shap(booster_model, data, ...)

Arguments

booster_model

A model object. In the IBLM context it will be the "booster_model" item from an object of class "iblm"

data

A data frame containing the predictor variables. Note anything extra will be quietly dropped.

...

Additional arguments passed to methods.

Details

Currently only a booster_model of class 'xgb.Booster' is supported

Value

A data frame of SHAP values, where each column corresponds to a feature and each row corresponds to an observation.

Methods (by class)

See Also

[xgboost::predict.xgb.Booster()]

Examples

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

booster_shap <- extract_booster_shap(iblm_model$booster_model, df_list$test)

booster_shap |> dplyr::glimpse()


French Motor Insurance Claims Dataset

Description

A dataset containing information about French motor insurance policies and claims, commonly used for actuarial modeling and risk assessment studies.

This is a "mini" subset of the CASdatasets 'freMTPL2freq' data, with some manipulation (see details)

Usage

freMTPLmini

Format

A data frame with 25,000 rows and 8 variables:

ClaimRate

Number of claims made, at an annualised rate, rounded (integer)

VehPower

Vehicle power rating or engine horsepower category (integer)

VehAge

Age of the vehicle in years (integer)

DrivAge

Age of the driver in years (integer)

BonusMalus

Bonus-malus coefficient, a rating factor used in French insurance where lower values indicate better driving records (integer)

VehBrand

Vehicle brand/manufacturer code (factor with levels like B6, B12, etc.)

VehGas

Type of fuel used by the vehicle (factor with levels: Regular, Diesel)

Area

Area classification where the policy holder resides (factor with levels A through F)

Details

The dataset is a random sample of 50,000 records from 'freMTPL2freq' from the 'CASdatasets' pacakge. Other modifications applied are:

Source

['https://github.com/dutangc/CASdatasets/raw/c49cbbb37235fc49616cac8ccac32e1491cdc619/data/freMTPL2freq.rda']

Examples

head(freMTPLmini)


Calculate Pinball Scores for IBLM and Additional Models

Description

Computes Poisson deviance and pinball scores for an IBLM model alongside homogeneous, GLM, and optional additional models.

Usage

get_pinball_scores(
  data,
  iblm_model,
  trim = NA_real_,
  additional_models = list()
)

Arguments

data

Data frame. If you have used 'split_into_train_validate_test()' this will be the "test" portion of your data.

iblm_model

Fitted IBLM model object of class "iblm"

trim

Numeric trimming parameter for IBLM predictions. Default is 'NA_real_'.

additional_models

(Named) list of fitted models for comparison. These models MUST be fitted on the same data as 'iblm_model' for sensible results. If unnamed, models are labeled by their class.

Details

Pinball scores are calculated relative to a homogeneous model (i.e. a simple mean prediction of training data). Higher scores indicate better predictive performance.

Value

Data frame with 3 columns:

Examples

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

get_pinball_scores(data = df_list$test, iblm_model = iblm_model)


Load French Motor Third-Party Liability Frequency Dataset

Description

Downloads the French Motor Third-Party Liability (freMTPL2freq) dataset from the CASdatasets GitHub repository, and apply minor transformations.

Usage

load_freMTPL2freq()

Details

The function performs the following modifications to the sourced:

Value

A data frame containing the following columns:

ClaimNb

Number of claims per year.

VehPower

The power of the car (ordered categorical).

VehAge

The vehicle age in years, capped at 50.

DrivAge

The driver age in years (minimum 18, the legal driving age in France).

BonusMalus

Bonus/malus coefficient, ranging from 50 to 350. Values below 100 indicate a bonus (discount), while values above 100 indicate a malus (surcharge) in the French insurance system.

VehBrand

The car brand (categorical, with unknown/proprietary category labels).

VehGas

The car fuel type: "Diesel" or "Regular".

Area

The density classification of the city community where the driver lives, ranging from "A" (rural area) to "F" (urban centre).

Density

The population density (inhabitants per square kilometer) of the city where the driver lives.

Region

The policy region in France, based on the 1970-2015 administrative classification.

Note

This function requires an internet connection to download the data.

References

Dutang, C. CASdatasets: Insurance datasets. https://github.com/dutangc/CASdatasets/raw/c49cbbb37235fc49616cac8ccac32e1491cdc619/data/freMTPL2freq.rda

Examples


# Load the preprocessed dataset
freMTPL2freq <- load_freMTPL2freq()

freMTPL2freq |> dplyr::glimpse()



Plot Overall Corrections from Booster Component

Description

Creates a visualization showing for each record the overall booster component (either multiplicative or additive)

NOTE This function signature documents the interface of functions created by create_overall_correction.

Usage

overall_correction(transform_x_scale_by_link = TRUE)

Arguments

transform_x_scale_by_link

TRUE/FALSE, whether to transform the x axis by the link function

Value

A ggplot2 object.

See Also

create_overall_correction, explain_iblm

Examples

# This function is created inside explain_iblm() and is output as an item

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

explain_objects <- explain_iblm(iblm_model, df_list$test)

explain_objects$overall_correction()


# This function must be created, and cannot be called directly from the package
try(
overall_correction()
)

Predict Method for IBLM

Description

This function generates predictions from an ensemble model consisting of a GLM and an XGBoost model.

Usage

## S3 method for class 'iblm'
predict(object, newdata, trim = NA_real_, type = "response", ...)

Arguments

object

An object of class 'iblm'. This should be output by 'train_iblm_xgb()'

newdata

A data frame or matrix containing the predictor variables for which predictions are desired. Must have the same structure as the training data used to fit the 'iblm' model.

trim

Numeric value for post-hoc truncating of XGBoost predictions. If NA (default) then no trimming is applied.

type

string, defines the type argument used in GLM/Booster Currently only "response" is supported

...

additional arguments affecting the predictions produced.

Details

The prediction process involves the following steps:

  1. Generate GLM predictions

  2. Generate Booster predictions

  3. If trimming is specified, apply to booster predictions

  4. Combine GLM and Booster predictions as per "relationship" described within iblm model object

At this point, only an iblm model with a "booster_model" object of class 'xgb.Booster' is supported

Value

A numeric vector of ensemble predictions computed as the element-wise product of GLM response probabilities and (optionally trimmed) XGBoost predictions.

See Also

predict.glm, predict.xgb.Booster

Examples

data <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  data,
  response_var = "ClaimRate",
  family = "poisson"
)

predictions <- predict(iblm_model, data$test)

predictions |> dplyr::glimpse()


Convert Shap values to Wide One-Hot Encoded Format

Description

Transforms categorical variables in a data frame into one-hot encoded format. Renames "BIAS" to lowercase.

Usage

shap_to_onehot(shap, wide_input_frame, iblm_model)

Arguments

shap

Data frame containing raw SHAP values from XGBoost.

wide_input_frame

Wide format input data frame (one-hot encoded).

iblm_model

Object of class 'iblm'

Value

A data frame where SHAP values are in wide format for categorical variables. Column "bias" is moved to start.

Examples

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)

shap <- stats::predict(
  iblm_model$booster_model,
  newdata = xgboost::xgb.DMatrix(
    data.matrix(
      dplyr::select(df_list$test, -dplyr::all_of(iblm_model$response_var))
    )
  ),
  predcontrib = TRUE
) |> data.frame()

wide_input_frame <- data_to_onehot(df_list$test, iblm_model)

shap_wide <- shap_to_onehot(shap, wide_input_frame, iblm_model)

shap_wide |> dplyr::glimpse()


Split Dataframe into: 'train', 'validate', 'test'

Description

This function randomly splits a data frame into three subsets for machine learning workflows: training, validation, and test sets. The proportions can be customized and must sum to 1.

Usage

split_into_train_validate_test(
  df,
  train_prop = 0.7,
  validate_prop = 0.15,
  test_prop = 0.15,
  seed = NULL
)

Arguments

df

A data frame to be split into subsets.

train_prop

A numeric value between 0 and 1 specifying the proportion of data to allocate to the training set.

validate_prop

A numeric value between 0 and 1 specifying the proportion of data to allocate to the validation set.

test_prop

A numeric value between 0 and 1 specifying the proportion of data to allocate to the test set.

seed

(optional) a numeric value to set the random no. seed within function environment.

Details

The function assigns each row to either "train", "validate" or "test" with the probability defined in the function.

Because each row is assigned a bucket independently, for very small datasets the proportions may not be as desired. This should not be an issue as data used for 'iblm' must be reasonably large.

Value

A named list with three elements:

train

A data frame containing the training subset

validate

A data frame containing the validation subset

test

A data frame containing the test subset

Examples

# Using 'mtcars'
split_into_train_validate_test(
  mtcars,
  train_prop = 0.6,
  validate_prop = 0.2,
  test_prop = 0.2,
  seed = 9000
)


Custom ggplot2 Theme for IBLM

Description

Custom ggplot2 Theme for IBLM

Usage

theme_iblm()

Value

A ggplot2 theme object that can be added to plots.


Train IBLM Model on XGBoost

Description

This function trains an interpretable boosted linear model.

The function combines a Generalized Linear Model (GLM) with a booster model of XGBoost

The "booster" model is trained on: - actual responses / GLM predictions, when the link function is log - actual responses - GLM predictions, when the link function is identity

Usage

train_iblm_xgb(
  df_list,
  response_var,
  family = "poisson",
  params = list(),
  nrounds = 1000,
  obj = NULL,
  feval = NULL,
  verbose = 0,
  print_every_n = 1L,
  early_stopping_rounds = 25,
  maximize = NULL,
  save_period = NULL,
  save_name = "xgboost.model",
  xgb_model = NULL,
  callbacks = list(),
  ...,
  strip_glm = TRUE
)

Arguments

df_list

A named list containing training and validation datasets. Must have elements named "train" and "validate", each containing df_list frames with the same structure. This item is naturally output from the function [split_into_train_validate_test()]

response_var

Character string specifying the name of the response variable column in the datasets. The string MUST appear in both 'df_list$train' and 'df_list$validate'.

family

Character string specifying the distributional family for the model. Currently only "poisson", "gamma", "tweedie" and "gaussian" is fully supported. See details for how this impacts fitting.

params

Named list of additional parameters to pass to xgb.train. Note that train_iblm_xgb will select "objective" and "base_score" for you depending on 'family' (see details section). However you may overwrite these (do so with caution)

nrounds, obj, feval, verbose, print_every_n, early_stopping_rounds, maximize, save_period, save_name, xgb_model, callbacks, ...

These are passed directly to xgb.train

strip_glm

TRUE/FALSE, whether to strip superfluous data from the 'glm_model' object saved within 'iblm' class that is output. Only serves to reduce memory constraints.

Details

The 'family' argument will be fed into the GLM fitting. Default values for the XGBoost fitting are also selected based on family.

Note: Any xgboost configuration below will be overwritten by any explicit arguments input via 'params'

For "poisson" family the link function is 'log' and XGBoost is configured with:

For "gamma" family the link function is 'log' and XGBoost is configured with:

For "tweedie" family the link function is 'log' (with a var.power = 1.5) and XGBoost is configured with:

For "gaussian" family the link function is 'identity' and XGBoost is configured with:

Value

An object of class "iblm" containing:

glm_model

The GLM model object, fitted on the 'df_list$train' data that was provided

booster_model

The booster model object, trained on the residuals leftover from the glm_model

data

A list containing the data that was used to train and validate this iblm model

relationship

String that explains how to combine the 'glm_model' and 'booster_model'. Currently only either "Additive" or "Multiplicative"

response_var

A string describing the response variable used for this iblm model

predictor_vars

A list describing the predictor variables used for this iblm model

cat_levels

A list describing the categorical levels for the predictor vars

coeff_names

A list describing the coefficient names

See Also

glm, xgb.train

Examples

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

iblm_model <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson"
)


Train XGBoost Model Using the IBLM Model Parameters

Description

Trains an XGBoost model using parameters extracted from the booster residual component of the iblm model. This is a convenient way to fit an XGBoost model for direct comparison with a fitted iblm model

Usage

train_xgb_as_per_iblm(iblm_model, ...)

Arguments

iblm_model

Ensemble model object of class "iblm" containing GLM and XGBoost model components. Also contains data that was used to train it.

...

optional arguments to insert into xgb.train(). Note this will cause deviation from the settings used for training 'iblm_model'

Value

Trained XGBoost model object (class "xgb.Booster").

See Also

xgb.train

Examples

df_list <- freMTPLmini |> split_into_train_validate_test(seed = 9000)

# training with plenty of rounds allowed
iblm_model1 <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson",
  max_depth = 6,
  nrounds = 1000
)

xgb1 <- train_xgb_as_per_iblm(iblm_model1)

# training with severe restrictions (expected poorer results)
iblm_model2 <- train_iblm_xgb(
  df_list,
  response_var = "ClaimRate",
  family = "poisson",
  max_depth = 1,
  nrounds = 5
)

xgb2 <- train_xgb_as_per_iblm(iblm_model2)

# comparison shows the poor training mirrored in second set:
get_pinball_scores(
  df_list$test,
  iblm_model1,
  trim = NA_real_,
  additional_models = list(iblm2 = iblm_model2, xgb1 = xgb1, xgb2 = xgb2)
)