The predRupdate package includes a set of functions
to aid in the validation of a clinical prediction model (CPM) on a given
dataset, and to apply various model updating and aggregation methods.
This vignette aims to overview the technical details of the methods that
are implemented within the predRupdate package. For an
introduction to using the package, please see
vignette("predRupdate")
.
Clinical prediction models (CPMs) are statistical models that aim to predict the presence (diagnostic models) or future occurrence (prognostic models) of an outcome of interest for an individual, using information (predictor variables) that are available about that individual at the time the prediction is made. For example, we might use someone’s age, sex, smoking status and family medical history to predict their risk of developing cardiovascular disease in the next ten years. These models can be used to aid clinical decision-making, form the cornerstone of decision support systems, and underpin clinical audit and feedback tasks.
The usual stages for the production of CPMs are: (i) model development and internal validation, (ii) external validation in new data, potentially with updating as needed, and (iii) impact assessment. These are summarised in the table below:
Phase | Description |
---|---|
(i) Model Development and internal validation | The development of the model uses a cohort of patients, where the predictor variables and outcomes are known, to perform variable selection and estimate the prognostic effect (coefficient) of each predictor on the outcome of interest. This is followed by assessment of predictive performance within that data, adjusting for in-sample optimism |
(ii) External validation | Aims to assess the predictive performance of the developed model in data that are different to those used to develop the model (e.g., temporally or geographically different). If performance is deemed unsuitable, the model may undergo updating to tailor it more closely to the new data |
(iii) Impact Assessment | Impact assessment investigates the extent to which the model changes clinical practice and improves patient outcomes |
The aim of predRupdate is to provide tools for stage (ii). Specifically, the package is intended for the situation where a CPM has already been developed (e.g., a model available in the literature), and one wishes to apply the model to a new dataset for validation, updating, or both. The package assumes that one has access to the reported model parameters (all those that are required to make a prediction of risk for a new observation) and an individual participant dataset in which one wishes to apply the model.
This vignette is not intended to be a detailed tutorial of prediction modelling. For that, we refer readers to Riley et al. 2019 and Steyerberg 2009. Instead, this vignette is intended to describe the technical details of how certain methodologies have been implemented in predRupdate.
The predictive performance of a CPM is summarised by its calibration,
discrimination and overall accuracy. We provide a brief overview of the
technical details that are implemented in predRupdate
in the pred_validate()
function, but refer readers to the
literature for a detailed overview (Riley et al. 2019; Steyerberg et
al. 2013; Moons et al. 2012; Altman and Royston 2000; Van Calster et
al. 2016; McLernon et al. 2023).
Calibration is the agreement of the predicted risks from the CPM with the observed risks in the validation dataset, across the full risk range. A primary method of assessing calibration is to produce a flexible calibration plot, which graphically depicts the estimated risk (x-axis) and the observed probabilities (y-axis). The observed probabilities are obtained by regressing the observed outcomes in the validation dataset against the linear predictor (calculated for each individual in the validation dataset), using loess or splines.
Specifically, suppose an existing (logistic regression) CPM has been developed (in another dataset) to predict the probability of a binary outcome, \(Y\). This CPM is given by
\[ \log\left(\frac{\pi_i}{1-\pi_i}\right) = \hat{\beta}_0 + \hat{\beta}_1X_{i,1} + \hat{\beta}_2X_{i,2} + ... + \hat{\beta}_PX_{i,P} \]
where \(\pi_{i} = P(Y_i = 1)\), with \(\hat{\beta}_{0},...\hat{\beta}_{P}\) being the estimated set of regression coefficients (log odds ratios; taken from the original development of the model) and \(X_{i,p}\) the value of predictor variable \(p\) for individual \(i\). This linear combination of the regression coefficients and predictor variables is the linear predictor of the model.
To obtain a flexible calibration plot of this model within the
validation data, pred_validate()
fits the following model
to the validation data:
\[ \log\left(\frac{\pi_i}{1-\pi_i}\right) = \alpha_0 + s(LP_i) \]
where \(s(.)\) is some smooth
function; in pred_validate()
this is a natural cubic
spline. This model is then used to obtain observed probabilities for
each individual in the validation dataset, which are plotted against the
corresponding predicted risks obtained from the CPM. A resulting curve
close to the diagonal (y=x line) indicates that predicted risks
correspond well to observed proportions. In
pred_validate()
, a histogram of the predicted risk
distribution is overlaid with the calibration plot, to visually
summarise the probability distribution.
If validating an existing time-to-event CPM,
pred_validate()
produces the flexible calibration plot
using methods similar to those described above for logistic regression
models. Here, the observed probabilities are obtained by fitting a Cox
proportional hazards model to regress the linear predictor of the model
(under natural cubic spline) against the time-to-event outcome in the
validation dataset. A time horizon must also be specified (i.e., the
time during follow-up in which to assess calibration). Such a plot can
only be produced for an existing time-to-event CPM if the cumulative
baseline hazard of the model is supplied.
Alongside flexible calibration plots, pred_validate()
also calculates the calibration slope. The calibration slope (ideal
value 1) indicates the level of over-fitting of the model, with values
less than 1 indicating more over-fitting. In
pred_validate()
this is estimated by fitting a logistic
regression model (if the existing CPM is a logistic model) or Cox
proportion hazards model (if the existing CPM is a survival model) to
the observed outcomes in the validation dataset with the linear
predictor from the CPM as the only covariate.
Finally, calibration-in-the-large summarises how close the mean predicted risk is to the mean outcome proportion in the validation dataset. If validating a logistic regression CPM, this is quantified using the calibration intercept, estimated using the same model as estimating the calibration slope, with the slope fixed at unity. Here, a calibration intercept less than 0 would indicate that the mean predicted risk is higher than the observed outcome proportion. If validating a time-to-event (survival) CPM, then calibration-in-the-large is quantified with the observed:expected ratio (ideal value 1) at a fixed time horizon. Here, the observed proportion is obtained using Kaplan-Meier estimate. Such a metric can only be produced for an existing time-to-event CPM if the cumulative baseline hazard of the model is supplied such that the predicted risks at the time horizon can be calculated.
Discrimination of a CPM is the ability of the model to differentiate
those who experience the outcome from those who do not; i.e., does the
model estimate a higher predicted risk, on average, for those who
experience the outcome, compared to those who do not experience the
outcome. For validating logistic regression CPMs,
pred_validate()
calculates the area under the receiver
operating characteristic curve (AUC) of the CPM. For validating
time-to-event (survival) CPMs, pred_validate()
calculates
Harrell’s C-statistic. An AUC/ C-statistic of 0.5 would indicate
discrimination no better than chance, whereas a value of 1 indicates
perfect calibration.
For validating a logistic regression model,
pred_validate()
also produces two additional metrics of
overall accuracy of the model. The first is the Cox-Snell and Nagelkerke
R-squared values. Here, higher R-squared values indicate better overall
performance of the model.
The Cox-Snell R-squared is estimated by
\[ R^{2}_{CS} = 1 - \exp\left(\frac{-LR}{n}\right) \]
where \(LR\) is the likelihood ratio statistic, and \(n\) is the size of the validation dataset. Specifically,
\[ LR = -2(L_{null} - L_{model}) \]
with \(L_{model}\) being the log-likelihood of the CPM on the validation dataset, and \(L_{null}\) is the log-likelihood of an intercept-only model in the validation dataset. Nagelkerke R-squared is then a scaled version of Cox-Snell R-squared, such that 1 is the ‘best’ value (unlike Cox-Snell R-squared).
Finally, the Brier score of the logistic regression CPM is also calculated, as the average squared difference between the observed outcome and the predicted risks from the model.
Upon validating an existing CPM in new data (external validation) it is not uncommon to find that the models predictive performance deteriorates. Rather than developing a new model, an alternative strategy is to apply model updating methods. See Moons et al. 2012 and Su et al. 2018 for a detailed overview of these methods. The extent of model updating depends on the issues identified during model validation.
If the model validation identifies that the CPM has poor
calibration-in-the-large, then a simple updating strategy is “intercept
update”. For a logistic regression CPM, pred_update()
fits
the following model in the new dataset using maximum likelihood
estimation:
\[ \log\left(\frac{\pi_i}{1-\pi_i}\right) = \alpha_0 + LP_i = \alpha_0 + (\hat{\beta}_0 + \hat{\beta}_1X_{i,1} + ... + \hat{\beta}_PX_{i,P}) \]
Here, \(\alpha_0\) serves to update the intercept of the original model such that the mean predicted risk from the (updated) CPM matches the observed event proportion in the new dataset. Specifically, the new intercept will be \(\alpha_0 + \hat{\beta}_0\). All other terms (\(\beta_{1},...,\beta_{P}\)) remain the same as originally published.
Often, the original model will also demonstrate elements of
over-fitting (i.e., the calibration slope of the model is <1 in the
validation data). In such a case, the model can be re-calibrated in the
new dataset. For a logistic regression CPM, pred_update()
fits the following model in the new dataset using maximum likelihood
estimation:
\[ \log\left(\frac{\pi_i}{1-\pi_i}\right) = \alpha_0 + \alpha_1LP_i = \alpha_0 + \alpha_1(\hat{\beta}_0 + \hat{\beta}_1X_{i,1} + ... + \hat{\beta}_PX_{i,P}) \]
This has the effect of multiplying each original model coefficient by \(\alpha_1\). Specifically, the new (updated) regression coefficients will be \(\alpha_1 \times \beta{p}\), and the new model intercept is \(\alpha_0 + (\alpha_1 \times \hat{\beta}_0)\).
Neither of the above methods alter the discrimination of the model within the new dataset. For this, one option is to refit the model to the new dataset. Here, one would fit the following model in the new dataset using maximum likelihood estimation:
\[ \log\left(\frac{\pi_i}{1-\pi_i}\right) = \hat{\alpha}_0 + \hat{\alpha}_1X_{i,1} + \hat{\alpha}_2X_{i,2} + ... + \hat{\alpha}_PX_{i,P} \]
Here, \(\alpha_{0},...,\alpha_{P}\) become the new regression coefficients of the updated model. There is also a less extreme version of model refitting, where one only re-estimates certain parameters. This will be implemented into the package in the near future.
While we describe the updating methods above for logistic regression models, predRupdate also implements these approaches for time-to-event (survival) CPMs. The principles above are similar, expect the underlying model is changed to a Cox proportional hazards model (further parametric/ flexible parametric methods to be added in the future). Specifically, “intercept update” for a survival (time-to-event) CPM, has the following form: \[ h(t|X) = h_0(t)\exp(LP_i) = h_0(t)\exp(\hat{\beta}_1X_{i,1} + ... + \hat{\beta}_PX_{i,P}) \]
which has the effect of only re-estimating the baseline hazard in the new data. For re-calibration, the updated survival (time-to-event) CPM would be: \[ h(t|X) = h_0(t)\exp(\alpha_1LP_i) = h_0(t)\exp(\alpha_1(\hat{\beta}_1X_{i,1} + ... + \hat{\beta}_PX_{i,P})) \] This has the effect of multiplying each original model coefficient (log-hazard ratios) by \(\alpha_1\), and estimating a new baseline hazard in the new data.
In some situations, there are instances where multiple existing CPMs
are available for the same prediction task (e.g., existing models
developed across different countries). Here, model aggregation methods
can be used to pool these existing CPMs into a single model in the new
data. Various methods exist for this (Martin et al. 2017), with
predRupdate currently implementing stacked regression
(Debray et al. 2014) through pred_stacked_regression()
.
More methods will be added in the future.
Specifically, suppose there are a collection of \(M\) existing logistic regression CPMs, which all aim to predict the same binary outcome, \(Y\), but where developed in different populations \(j=1,...,M\). The models may contain different predictor variables. Each model has a linear predictor (LP) given by
\[ \log\left(\frac{\pi_i}{1-\pi_i}\right) = \hat{\beta}_{0,j} + \hat{\beta}_{1,j}X_{i,1} + \hat{\beta}_{2,j}X_{i,2} + ... + \hat{\beta}_{P,j}X_{i,P} = LP_{i,j} \]
In this notation, if a given variable is not included in a given CPM,
then the corresponding \(\beta\) is
equal to zero. One can then apply these models to each observation \(i\) in the validation set to calculate the
set of \(M\) linear predictors.
pred_stacked_regression()
then applies stacked regression
by regressing these set of linear predictors against the observed
outcome in the validation dataset as
\[ \log\left(\frac{\pi_i}{1-\pi_i}\right) = \gamma_0 + \gamma_1LP_{i,1} + ... + \gamma_MLP_{i,M} \]
This equation can be re-arranged to express in terms of the set of
new (aggregated) predictor coefficients for each of the \(P\) predictor variables. Given that the
\(M\) existing models all aim to
predict the same outcome, there could be a high level of co-linearity in
the set of \(M\) linear predictors.
Therefore, there have been suggestions to fit the above stacked
regression model under the constraint that all of \(\gamma_1,...\gamma_M\) should be
non-negative. pred_stacked_regression()
implements stacked
regression with and without this constraint.
pred_stacked_regression()
can also implement stacked
regression for \(M\) existing
time-to-event (survival) CPMs. The technical details are similar to
those described above for logistic models, with the exception that the
underlying model is changed to a Cox proportional hazards model.