Semiparametric Covariate Effects in the brea Package

Modeling Nonlinear Covariate Effects in the Discrete Cox Model

Here we will briefly review the discrete time version of the Cox proportional hazards model introduced in (Cox 1972). Then we will show how to extend the model by including nonlinear effects of the form \(f_m(X_m)\) inside the linear predictor and in turn explain our Bayesian formulation for the functions \(f_m\). Finally, we will briefly discuss inference algorithms for the parameters representing the functions \(f_m\).

The Discrete Time Cox Proportional Hazards Model

Let \(T\) denote the discrete time of event occurrence; by convention we assume the possible timepoints \(t\) of occurrence are the positive integers (\(t=1,2,3,\ldots\)). The discrete time Cox proportional hazards model relates the discrete time hazard rate \(\lambda(t)=P(T=t|T\geq t)\) to a linear predictor \(\eta(t)\) incorporating covariate effects using the logit link fuction: \[ \text{log}\left(\frac{\lambda(t)}{1-\lambda(t)}\right) = \eta(t) = f_0(t)+\beta_1X_1 + \cdots + \beta_MX_M \] The function \(f_0(t)\) is the baseline hazard that models the effect of discrete time \(t\) on the linear predictor scale, and we classically we do not presume any specific functional form for \(f_0(t)\) and instead just assume the function is an arbitrary smooth function. In contrast, the other covariate effects \(\beta_m X_m\) are modeled linearly as in standard multiple linear regression.

Additive Modeling of Nonlinear Covariate Effects

We would like to extend the above Cox model by allowing arbitrary nonlinear effects for quantitative covariates other than just time \(t\). We will also explicitly represent the potentially time-varying nature of any of our covariate values by writing the \(m^\text{th}\) covariate as \(X_m(t)\). With this change, there is no longer any reason to single out time \(t\) as a distinct covariate needing separate notation from the other \(X_m(t)\), since we could just for example let the first covariate be \(t\) by letting \(X_1(t)=t\). Hence, we can write our model as: \[ \text{log}\left(\frac{\lambda(t)}{1-\lambda(t)}\right) = \eta(t) = f_1(X_1(t))+f_2(X_2(t))+\cdots +f_M(X_M(t)) \] For categorical covariates, we may still use a representation \(f_m(X_m(t))\) for the corresponding effect by declaring that the function \(f_m\) assumes a distinct parameter value for each possible category of \(X_m\). For example, if the possible covariate categories are coded using positive integers \(k=1,\ldots,K\) (as the brea package assumes), then we may let \(f_m(k)=\beta_k\) so that \(f_m(X_m(t))\) becomes simply \(\beta_{X_m(t)}\).

Modeling Nonlinear Effects Using Step Functions

There are many possible choices for how to model the smooth functions \(f_m\) when \(X_m\) is a quantitative variable. For example, we could use a parametric function such as a polynomial or a crude step function; both of these possibilities are illustrated for the baseline hazard \(f_0(t)\) in the Introduction to brea vignette. However, it is often not possible to know in advance the appropriate functional form for a function that relates a quantitative covariate like time \(t\) to the hazard of event occurrence. In addition, the functional form may not be able to be accurately captured by a simple polynomial or step function with a small number of steps.

Thus, we propose modeling the functions \(f_m\) using a highly flexible formulation. Specifically, we will use a step function with a large number of steps (usually 10–100 steps) along with a prior distribution on the step heights that ensures the resulting functional form is not too irregular (i.e., the function is approximately smooth). Specifically, suppose we have a quantitative covariate \(X\), and let \(c_0,c_1,\ldots,c_K\) be a sequence of step boundaries such that all values of \(X\) satisfy \(c_0<X\leq c_K\). We assume the function \(f\) modeling the effect of \(X\) is constant on each step interval \((c_{k-1},c_k]\), and we denote the corresponding step height as \(\beta_k\) (for \(k=1,\ldots,K\)).

Bayesian Smoothing of Step Heights Using GMRF’s

Since the total number of steps \(K\) we will use to model a function \(f\) is large, there may be step intervals \((c_{k-1},c_k]\) with few observations (in particular, few observed events) with \(X\) values in that interval. Thus, we will place a prior probability distribution on the vector of step heights \((\beta_1,\beta_2,\ldots,\beta_K)\) that ensures that the step heights are not too erratic for ranges of \(X\) values in which we have small amounts of data per step. In other words, we will force some amount of smoothness in our step function even when the amount of data per step is insufficient to precisely estimate each step height in isolation.

The class of prior distributions we will use is called a Gaussian Markov random field (GMRF). The Gaussian name refers to the fact that these are (improper) multivariate normal distributions, and the Markov name refers to the fact that these distributions are effectively Markov random walks, where each subsequent step height is presumed to be close to the previous step (and thus in turn must also be close to the subsequent step as well). A technical description of these distributions is beyond the scope of this tutorial, and we refer the interested reader to (Rue and Held 2005) for a general in-depth treatment or (King and Weiss 2021) for a detailed treatment in the context of our discrete Cox model.

Metropolis-Hastings Inference for Parameters with GMRF Priors

The brea package uses Markov chain Monte Carlo (MCMC) algorithms to draw samples from the posterior distribution of the model parameters. As discussed in the Introduction to brea vignette, these posterior samples are not independent samples but are instead correlated samples, in which each subsequent draw from the posterior tends to be similar to the one that preceeded it. As a result, MCMC samples provide less information about the posterior distribution than independent samples of the same size, and how much less information depends on the amount of serial correlation or autocorrelation in the MCMC sample.

One MCMC inference algorithm available in brea is a random walk Metropolis algorithm that updates one parameter at a time. When model parameters have high posterior correlation (which is often the case when GMRF priors induce posterior correlation among neighboring step height parameters), random walk Metropolis algorithms with single-parameter updates often result in high autocorrelation of the MCMC sampled valued. This means the MCMC efficiency is poor, and the length of the MCMC run \(S\) (i.e., the size of the posterior sample) must be increased substantially in order to obtain enough information about the posterior distribution to achieve accurate inferences. Values such as \(S=\) 100,000 or \(S=\) 1,000,000 can take minutes or even hours of computation time for models used on large data sets, which can be especially impractical if the user is trying many models.

Hence, the brea package also offers an optional block Metropolis-Hastings algorithm that updates many parameters at once, using updates that explore the posterior distribution efficiently even when there is high posterior correlation among the model parameters. The extent of efficiency improvement over the default random walk Metropolis algorithm is highly dependent on the particular data set and model, but typically ranges between \(1x\) efficiency improvement (i.e., no improvement) and \(100x\) efficiency improvement on a per-iteration basis. However, the block Metropolis-Hastings updates updates typically require 2–3 times as much computation per MCMC iteration, so it is possible that overall (in terms of wall clock time), the random walk Metropolis algorithm may perform better. We illustrate this varying level of performance improvement at the end of the following section.

Example Implementation of a Smooth Baseline Hazard in a Cox Model

We will illustrate use of the gmrf prior type in the brea package to model a smooth baseline hazard function in a simple Cox model, and we will use the same data set as in the Introduction to brea vignette, namely the leukemia remission data from (Cox 1972). This data came from a clinical trial of acute leukemia patients with two treatment groups, the drug 6-mercaptopurine (6-MP) and a placebo control. There were \(n=42\) patients in total with 21 assigned to each group. The outcome of interest was the duration of leukemia remission (i.e., the event of interest was the end of cancer remission, so lower hazard of event occurrence is better). These durations were measured in whole weeks, so time here is discrete.

In the remainder of this section, we will format the data for use with the brea package, give posterior inferences for the model parameters (specifically the parameters representing the smooth baseline hazard), and finally illustrate the performance improvement from using the block Metropolis-Hastings algorithm.

Data Setup

The data as given in (Cox 1972) is as follows:

Treatment Group	Observed Duration of Remission (weeks) (* denotes right-censored observations)
Group 1 (6-MP)	6, 6, 6, 6, 7, 9, 10,10, 11, 13, 16, 17, 19, 20, 22, 23, 25, 32, 32, 34, 35
Group 2 (control)	1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23

We begin by creating a study time variable time (time of event occurrence or right censoring), an event/censoring indicator variable event (1 for event occurrence, 0 for right censoring), and a treatment group variable treat (1 for treatment, 2 for control):

time <- c(6,6,6,6,7,9,10,10,11,13,16,17,19,20,22,23,25,32,32,34,35,
          1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23)

event <- c(0,1,1,1,1,0,0,1,0,1,1,0,0,0,1,1,0,0,0,0,0,
           1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)

treat <- c(rep(1,21),rep(2,21))

Next we expand the data into person-time format, with one observation for each discrete time point each patient was at risk for event occurrence. We first set the total number of person-time observations N and create matrices of height N to store the predictor values x (with first column subject id, second column time, and third column treatment group) and binary response variable y:

N <- sum(time) # total number of person-time observations
x <- matrix(0,N,3)  # columns id, time, and treatment group
colnames(x) <- c("id","t","treat")
y <- matrix(0,N,1) # only one column since only one competing risk

We then iterate through the 42 observations in the original (unexpanded) data set, expanding each one and adding the corresponding rows to x and y:

next_row <- 1  # next available row in the person-time matrices
for (i in seq_along(time)) {
  rows <- next_row:(next_row + time[i] - 1)  # person-time rows to fill
  x[rows,1] <- rep(i,time[i]) # subject id number is constant for each person
  x[rows,2] <- seq_len(time[i])  # discrete time is integers 1,2,...,time[i]
  x[rows,3] <- rep(treat[i],time[i])  # treatment group is constant
  y[rows,1] <- c(rep(0,time[i] - 1),event[i])  # outcome is 0's until study time
  next_row <- next_row + time[i]  # increment the next available row pointer
}

Finally, we will create a copy of the predictor matrix x specific to the model we want to fit by cutting the time variable \(t\) into intervals \((c_{k-1},c_k]\) as described above and retaining the treatment variable (the subject id variable is not needed in model fitting if we have no random effects):

x_brea <- matrix(0,N,2)
c_k <- seq(0,36,3) # step boundaries
x_brea[,1] <- cut(x[,2],c_k,labels=FALSE) # grouped time t
x_brea[,2] <- x[,3] # treatment group

Our step boundaries \(c_k\) are multiples of 3 weeks, so our step widths are all three weeks, giving us 12 steps in total. We will illustrate other choices for the step configuration later in this section.

Model Fitting

Before fitting the model, we also need to create a specification of the type of predictor and associated prior parameters for each of our two covariates. In brea, a prior specification is a list with one item for each predictor. Each item in turn is also a list, in which the first list element specifies the type of predictor, which here is either "gmrf" for quantitative covariates using the smoothed step function approach described above or "cat" for categorical covariates.

The remaining list elements set prior hyperparameters. Detailed discussion of hyperparameter specification is beyond the scope of this tutorial, and we again refer the reader to (King and Weiss 2021) for a detailed technical discussion. Here, we will just use noninformative versions of the prior specifications, which may generally be used as default prior specifications for their respective covariate types. These may be specified for our two covariates as follows:

priors <- list(list("gmrf",3,0.01),list("cat",4))

To obtain posterior inferences, we use the brea_mcmc function, which has the following function signature:

brea_mcmc(x, y, priors = NULL, S = 1000, B = 100, n = NULL, K = NULL,
          store_re = FALSE,  block_mh = TRUE)

We have already created the x, y, and priors arguments, and the only other arguments that concern us now are the number of post-burn-in MCMC iterations S as well as the number B of initial iterations discarded as burn-in. We will set these to 10 times their default values to ensure our inferences in the next section are sufficiently accurate. We will discuss accuracy assessment in more detail later in the final section. Our MCMC run may be executed as follows:

library(brea)
set.seed(1234)
fit <- brea_mcmc(x_brea,y,priors,10000,1000)

Because MCMC algorithms use pseudo-random number generation, results will generally differ on repeated runs even with identical function calls. Hence, we have set the seed of the random number generator to ensure that our results are exactly reproducible.

Posterior Inference

The structure of the R object returned by brea_mcmc is below:

str(fit)

## List of 4
##  $ b_0_s: num [1, 1:10000] -2.52 -2.93 -2.81 -2.68 -3.02 ...
##  $ b_m_s:List of 2
##   ..$ : num [1, 1:12, 1:10000] 0.08563 0.10531 0.05935 0.00911 -0.04625 ...
##   ..$ : num [1, 1:2, 1:10000] -0.536 0.536 -0.783 0.783 -0.783 ...
##  $ s_m_s:List of 2
##   ..$ : num [1, 1:10000] 0.0787 0.062 0.0865 0.0846 0.0992 ...
##   ..$ : NULL
##  $ b_m_a:List of 2
##   ..$ : int 8907
##   ..$ : int [1:2] 4490 4745

The list component that interests us now is b_m_s, which contains posterior sampled values of the parameters representing the covariate effects. Specifically, b_m_s[[m]][1,k,s] is the \(s^\text{th}\) sampled value (post burn-in) of the effect of the \(k^\text{th}\) possible covariate value of the \(m^\text{th}\) predictor on the hazard of event occurrence (the 1 index in [1,k,s] denotes that we are examining the first competing risk, which must be specified even in cases like this where there is only one competing risk). In the Introduction to brea vignette, we examined the treatment variable (m=2) effect in detail; here, we will look at the effect of the discrete time \(t\) variable (m=1).

Posterior medians of the time effect parameters on the linear predictor scale may be calculated as follows:

b_t_post_medians <- apply(fit$b_m_s[[1]][1,,],1,median)
b_t_post_medians

##  [1] -0.0944690686 -0.0599252086 -0.0335874917 -0.0054899807  0.0003766892
##  [6]  0.0111216229  0.0206079799  0.0505348763  0.0352363812  0.0265420304
## [11]  0.0222435188  0.0195390234

These values are not directly interpretable, since they represent the logarithms of the factors by which the hazard (or more technically, the odds of the hazard) differs from the overall average hazard. Hence, we exponentiate these values to get them on the hazard ratio scale:

hr <- exp(b_t_post_medians)
hr

##  [1] 0.9098559 0.9418350 0.9669703 0.9945251 1.0003768 1.0111837 1.0208218
##  [8] 1.0518335 1.0358645 1.0268974 1.0224928 1.0197312

Finally, we may visualize the corresponding estimated step heights in our step function (on the hazard ratio scale) as follows:

par(cex=0.66,mgp=c(1.75,0.5,0),mai=c(0.4,0.4,0.5,0.1))

# setup the plotting window:
plot(NA,NA,xlim=range(c_k),ylim=range(hr),log="y",
     xlab="Time at Risk (weeks)",ylab="Relative Discrete Hazard",
     main="Time Effect on Hazard of Remission Cessation")

# add each of the K=12 steps:
for (k in seq_len(length(c_k)-1)) {
  lines(c(c_k[k],c_k[k+1]),rep(hr[k],2))
}

We can see the that the hazard of remission cessation increases steadily until it peaks just after 20 weeks, leveling off thereafter.

Comparing Different Step Widths

For our initial model fitting above, we used a step function approximation of the functional relationship between discrete time \(t\) and the hazard of event occurrence with a total of \(K=12\) steps, all with an equal width of 3 weeks. Here, we will compare this formulation to ones with either coarser or finer formulations (i.e., using fewer or greater numbers of steps). In particular, we will now obtain inferences from models with step widths of 6 weeks (6 steps total), 2 weeks (18 steps), and 1 week (36 steps), and we will compare the results to the model with a step width of 3 (12 steps) used earlier.

c_k6 <- seq(0,36,6) # 6-week apart step boundaries
x_brea[,1] <- cut(x[,2],c_k6,labels=FALSE) # grouped time t
set.seed(1234)
fit6 <- brea_mcmc(x_brea,y,priors,10000,1000)

c_k2 <- seq(0,36,2) # 2-week apart step boundaries
x_brea[,1] <- cut(x[,2],c_k2,labels=FALSE) # grouped time t
set.seed(1234)
fit2 <- brea_mcmc(x_brea,y,priors,10000,1000)

c_k1 <- seq(0,36,1) # 1-week apart step boundaries
x_brea[,1] <- cut(x[,2],c_k1,labels=FALSE) # grouped time t
set.seed(1234)
fit1 <- brea_mcmc(x_brea,y,priors,10000,1000)

Note that for the last formulation (36 steps each of length just 1 week), we could have simply used the original time variable x[,2], since this discrete time variable already consisted of postive integer values (and hence no grouping/discretization occurred).

As before, we will calculate posterior medians of the time effect in each case on the hazard ratio scale:

hr6 <- exp(apply(fit6$b_m_s[[1]][1,,],1,median))
hr2 <- exp(apply(fit2$b_m_s[[1]][1,,],1,median))
hr1 <- exp(apply(fit1$b_m_s[[1]][1,,],1,median))

We may then plot the step functions for all four choices of step width on the same graph:

par(cex=0.66,mgp=c(1.75,0.5,0),mai=c(0.4,0.4,0.5,0.1))

# setup the plotting window:
plot(NA,NA,xlim=range(c_k),ylim=range(c(hr,hr6,hr2,hr1)),log="y",
     xlab="Time at Risk (weeks)",ylab="Relative Discrete Hazard",
     main="Time Effect on Hazard of Remission Cessation")

for (k in seq_len(length(c_k6)-1)) {
  lines(c(c_k6[k],c_k6[k+1]),rep(hr6[k],2),col="red")
}

for (k in seq_len(length(c_k)-1)) {
  lines(c(c_k[k],c_k[k+1]),rep(hr[k],2))
}

for (k in seq_len(length(c_k2)-1)) {
  lines(c(c_k2[k],c_k2[k+1]),rep(hr2[k],2),col="green")
}

for (k in seq_len(length(c_k1)-1)) {
  lines(c(c_k1[k],c_k1[k+1]),rep(hr1[k],2),col="blue")
}

legend("topleft",
       legend=c("6-week Steps","3-week Steps","2-week Steps","1-week Steps"),
       col=c("red","black","green","blue"),lty=1)

We can see immediately that the coarsest formulation (6-week steps, in red) does not capture the apparent magnitude of the time effect variation across the 36 study weeks. The 3-week (black) and 2-week (green) step formulations do a better job of capturing the time effect variation, but still do not have adequate granularity to capture either the peak in the hazard rate around 22–23 weeks or the greatly diminished hazard of remission cessation at the very beginning of the study.

MCMC Performance Assessment and Comparison

We will now compare the efficiency of the block Metropolis-Hastings algorithm (selecting the block_mh = TRUE option for the brea_mcmc function) to the random walk Metropolis algorithm (block_mh = FALSE). In particular, we will examine the efficiency for inferences about the time effect parameters, since we expect efficiency in general for such parameters to be especially problematic given their high posterior correlation induced by the use of the GMRF prior.

For our comparison, we will use the original 3-week step length formulation; results for the other formulations are either similar or show even greater efficiency improvement for the Metropolis-Hastings version. We have already obtained inferences from the 3-week step model using Metropolis-Hastings, since block_mh = TRUE is the default option for the brea_mcmc function. We now obain inferences using random walk Metropolis for comparison purposes:

c_k <- seq(0,36,3) # step boundaries
x_brea[,1] <- cut(x[,2],c_k,labels=FALSE) # grouped time t
set.seed(1234)
fit_rwm <- brea_mcmc(x_brea,y,priors,10000,1000,block_mh = FALSE)

Next, we need to calculate the MCMC efficiency of our posterior samples for each of the \(K=12\) parameters representing the time effect. Efficiency is defined as the percentage of the information contained in an independent sample of the same size that is contained in our MCMC sample. For example, if we have an MCMC sample of size \(S=\) 1,000, and this sample contains one tenth of the amount of information contained in an independent sample of size 1,000, then the MCMC efficiency is 10%. Similarly, we define the effective sample size of an MCMC sample to be the size of an independent sample that contains the same amount of information as our MCMC sample. So for example, if an MCMC sample of size 1,000 contains the same amount of information as an independent sample of size 100, then the effective sample size is 100. These two metrics are related, since the MCMC efficiency equals the effective sample size divided by the size \(S\) of the MCMC sample.

To calculate the MCMC efficiency for our vector of 12 time effect parameters, we will use the effectiveSize function from the coda package to calculate effective sample sizes:

# make the effectiveSize function from the coda package available:
library(coda)

# MCMC sample size:
S <- 10000

# MCMC efficiency for the default Metropolis-Hastings algorithm:
eff_mh <- apply(fit$b_m_s[[1]][1,,],1,effectiveSize)/S

# MCMC efficiency for the random walk Metropolis algorithm:
eff_rwm <- apply(fit_rwm$b_m_s[[1]][1,,],1,effectiveSize)/S

We may then examine the computed efficiencies as well as the factor by which efficiency improved when moving from random walk Metropolis to Metropolis-Hastings:

# efficiencies of the two algorithms:
eff_rwm

##  [1] 0.01692673 0.01494314 0.01708034 0.02282697 0.03759462 0.05415988
##  [7] 0.06363047 0.03311775 0.02149289 0.01689545 0.01441453 0.01943719

eff_mh

##  [1] 0.3277317 0.4817151 0.5129773 0.4883707 0.6315530 0.5639637 0.6776411
##  [8] 0.3099692 0.5367384 0.5119067 0.3846150 0.4203444

summary(eff_rwm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01441 0.01692 0.02047 0.02771 0.03424 0.06363

summary(eff_mh)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3100  0.4114  0.5001  0.4873  0.5435  0.6776

# improvement factor for block Metropolis-Hastings over random walk Metropolis:
eff_mh/eff_rwm

##  [1] 19.361788 32.236545 30.033201 21.394456 16.799026 10.412942 10.649632
##  [8]  9.359609 24.972835 30.298498 26.682456 21.625774

summary(eff_mh/eff_rwm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.36   15.26   21.51   21.15   27.52   32.24

We can see that the typical efficiency (across the \(K=12\) parameters jointly representing the effect of time \(t\)) of the random walk Metropolis algorithm is around 2%, whereas the typical efficiency of the block Metropolis-Hastings algorithm is around 50%. In addition, the factor of efficiency improvement is typically around \(20x\) and ranges between \(10x\) and \(30x\) across the 12 parameters. We may also visualize these improvement levels:

par(cex=0.66,mgp=c(1.75,0.5,0),mai=c(0.4,0.4,0.5,0.1))

K <- 12
plot(NA,NA,xlim=c(1,K),ylim=c(0,0.7),xlab="Parameter Number",
     ylab="MCMC Efficiency",xaxt="n")
axis(1,1:K,1:K)
abline(h=0)
lines(1:K,eff_rwm,col="red",type="b",pch=20)
lines(1:K,eff_mh,col="blue",type="b",pch=20)
legend("topleft",
       legend=c("Block Metropolis-Hastings","Random Walk Metropolis"),
       col=c("blue","red"),lty=1,pch=20)

The efficiency of each of the respective algorithms depends on the amount of data available to estimate each individual parameter as well as how much the data indicate the hazard is changing in between steps. More specifically, random walk Metropolis performs the worst when there is little data for each step, since this leads to higher posterior correlation between model parameters. In contrast, the block Metropolis-Hastings algorithm’s performance is more consistent, but tends to be worst when the data for a certain step height “conflicts” with the prior (e.g., the data is pulling the \(8^\text{th}\) step height in our example higher, while the prior is pulling this step’s height lower from both sides). This is likely due to innaccuracy in the approximations used to propose parameter updates for the block Metropolis-Hastings algorithm in such cases.

Finally, we note that updating parameters for a particular covariate using the block Metropolis-Hastings algorithm takes about three times as much computation time per sampled value as when using random walk Metropolis. This means the computation time per MCMC iteration can be up to three times longer (usually this time penalty is smaller though, since only covariates with the gmrf prior type use the block Metropolis-Hastings algorithm). Still, with per-iteration MCMC efficiency increasing by a factor of 20 or more, overall real-time efficiency is still often improved by a factor of 10 or more.