The brea
package offers a number of advanced discrete
event time modeling features, one of which is generalized additive
models (GAM) style incorporation of arbitrary smooth nonlinear covariate
effects. For example, classically with either discrete or continuous
time Cox proportional hazards models, we assume the effect of time \(t\) is modeled via an arbitrary smooth
function called a baseline hazard, and this function is
incorporated additively on the linear predictor scale. We may however
also wish to model the nonlinear effects of other quantitative
covariates, such as a patient’s age in a biomedical study.
This tutorial will illustrate how to include such nonlinear effects
in discrete time-to-event models using the brea
package.
Because including such nonlinear functions can result in much slower
performance of the MCMC algorithms used to obtain inferences, we also
illustrate the use of an optional Metropolis-Hastings algorithm that can
dramatically increase the efficiency of the inference algorithms.
Throughout this tutorial, we assume the reader is familiar with the
basics of discrete time survival analysis, elementary Bayesian analysis
(including inference via Markov chain Monte Carlo), and basic use of the
brea
package. All of these topics are covered in the
Introduction to brea
vignette, which we strongly
suggest the reader work through first.
Here we will briefly review the discrete time version of the Cox proportional hazards model introduced in (Cox 1972). Then we will show how to extend the model by including nonlinear effects of the form \(f_m(X_m)\) inside the linear predictor and in turn explain our Bayesian formulation for the functions \(f_m\). Finally, we will briefly discuss inference algorithms for the parameters representing the functions \(f_m\).
Let \(T\) denote the discrete time of event occurrence; by convention we assume the possible timepoints \(t\) of occurrence are the positive integers (\(t=1,2,3,\ldots\)). The discrete time Cox proportional hazards model relates the discrete time hazard rate \(\lambda(t)=P(T=t|T\geq t)\) to a linear predictor \(\eta(t)\) incorporating covariate effects using the logit link fuction: \[ \text{log}\left(\frac{\lambda(t)}{1-\lambda(t)}\right) = \eta(t) = f_0(t)+\beta_1X_1 + \cdots + \beta_MX_M \] The function \(f_0(t)\) is the baseline hazard that models the effect of discrete time \(t\) on the linear predictor scale, and we classically we do not presume any specific functional form for \(f_0(t)\) and instead just assume the function is an arbitrary smooth function. In contrast, the other covariate effects \(\beta_m X_m\) are modeled linearly as in standard multiple linear regression.
We would like to extend the above Cox model by allowing arbitrary
nonlinear effects for quantitative covariates other than just time \(t\). We will also explicitly represent the
potentially time-varying nature of any of our covariate values by
writing the \(m^\text{th}\) covariate
as \(X_m(t)\). With this change, there
is no longer any reason to single out time \(t\) as a distinct covariate needing
separate notation from the other \(X_m(t)\), since we could just for example
let the first covariate be \(t\) by
letting \(X_1(t)=t\). Hence, we can
write our model as: \[
\text{log}\left(\frac{\lambda(t)}{1-\lambda(t)}\right) = \eta(t) =
f_1(X_1(t))+f_2(X_2(t))+\cdots +f_M(X_M(t))
\] For categorical covariates, we may still use a representation
\(f_m(X_m(t))\) for the corresponding
effect by declaring that the function \(f_m\) assumes a distinct parameter value
for each possible category of \(X_m\).
For example, if the possible covariate categories are coded using
positive integers \(k=1,\ldots,K\) (as
the brea
package assumes), then we may let \(f_m(k)=\beta_k\) so that \(f_m(X_m(t))\) becomes simply \(\beta_{X_m(t)}\).
There are many possible choices for how to model the smooth functions
\(f_m\) when \(X_m\) is a quantitative variable. For
example, we could use a parametric function such as a polynomial or a
crude step function; both of these possibilities are illustrated for the
baseline hazard \(f_0(t)\) in the
Introduction to brea
vignette. However, it is
often not possible to know in advance the appropriate functional form
for a function that relates a quantitative covariate like time \(t\) to the hazard of event occurrence. In
addition, the functional form may not be able to be accurately captured
by a simple polynomial or step function with a small number of
steps.
Thus, we propose modeling the functions \(f_m\) using a highly flexible formulation. Specifically, we will use a step function with a large number of steps (usually 10–100 steps) along with a prior distribution on the step heights that ensures the resulting functional form is not too irregular (i.e., the function is approximately smooth). Specifically, suppose we have a quantitative covariate \(X\), and let \(c_0,c_1,\ldots,c_K\) be a sequence of step boundaries such that all values of \(X\) satisfy \(c_0<X\leq c_K\). We assume the function \(f\) modeling the effect of \(X\) is constant on each step interval \((c_{k-1},c_k]\), and we denote the corresponding step height as \(\beta_k\) (for \(k=1,\ldots,K\)).
Since the total number of steps \(K\) we will use to model a function \(f\) is large, there may be step intervals \((c_{k-1},c_k]\) with few observations (in particular, few observed events) with \(X\) values in that interval. Thus, we will place a prior probability distribution on the vector of step heights \((\beta_1,\beta_2,\ldots,\beta_K)\) that ensures that the step heights are not too erratic for ranges of \(X\) values in which we have small amounts of data per step. In other words, we will force some amount of smoothness in our step function even when the amount of data per step is insufficient to precisely estimate each step height in isolation.
The class of prior distributions we will use is called a Gaussian Markov random field (GMRF). The Gaussian name refers to the fact that these are (improper) multivariate normal distributions, and the Markov name refers to the fact that these distributions are effectively Markov random walks, where each subsequent step height is presumed to be close to the previous step (and thus in turn must also be close to the subsequent step as well). A technical description of these distributions is beyond the scope of this tutorial, and we refer the interested reader to (Rue and Held 2005) for a general in-depth treatment or (King and Weiss 2021) for a detailed treatment in the context of our discrete Cox model.
The brea
package uses Markov chain Monte Carlo (MCMC)
algorithms to draw samples from the posterior distribution of the model
parameters. As discussed in the Introduction to
brea
vignette, these posterior samples are not
independent samples but are instead correlated
samples, in which each subsequent draw from the posterior tends to be
similar to the one that preceeded it. As a result, MCMC samples provide
less information about the posterior distribution than independent
samples of the same size, and how much less information depends on the
amount of serial correlation or autocorrelation in the
MCMC sample.
One MCMC inference algorithm available in brea
is a
random walk Metropolis algorithm that updates one parameter at
a time. When model parameters have high posterior correlation (which is
often the case when GMRF priors induce posterior correlation among
neighboring step height parameters), random walk Metropolis algorithms
with single-parameter updates often result in high autocorrelation of
the MCMC sampled valued. This means the MCMC efficiency is poor, and the
length of the MCMC run \(S\) (i.e., the
size of the posterior sample) must be increased substantially in order
to obtain enough information about the posterior distribution to achieve
accurate inferences. Values such as \(S=\) 100,000 or \(S=\) 1,000,000 can take minutes or even
hours of computation time for models used on large data sets, which can
be especially impractical if the user is trying many models.
Hence, the brea
package also offers an optional
block Metropolis-Hastings algorithm that updates many
parameters at once, using updates that explore the posterior
distribution efficiently even when there is high posterior correlation
among the model parameters. The extent of efficiency improvement over
the default random walk Metropolis algorithm is highly dependent on the
particular data set and model, but typically ranges between \(1x\) efficiency improvement (i.e., no
improvement) and \(100x\) efficiency
improvement on a per-iteration basis. However, the block
Metropolis-Hastings updates updates typically require 2–3 times as much
computation per MCMC iteration, so it is possible that overall (in terms
of wall clock time), the random walk Metropolis algorithm may perform
better. We illustrate this varying level of performance improvement at
the end of the following section.
We will illustrate use of the gmrf
prior type in the
brea
package to model a smooth baseline hazard function in
a simple Cox model, and we will use the same data set as in the
Introduction to brea
vignette, namely the leukemia
remission data from (Cox
1972). This data came from a clinical trial of acute leukemia
patients with two treatment groups, the drug 6-mercaptopurine (6-MP) and
a placebo control. There were \(n=42\)
patients in total with 21 assigned to each group. The outcome of
interest was the duration of leukemia remission (i.e., the event of
interest was the end of cancer remission, so
lower hazard of event occurrence is better). These durations
were measured in whole weeks, so time here is discrete.
In the remainder of this section, we will format the data for use
with the brea
package, give posterior inferences for the
model parameters (specifically the parameters representing the smooth
baseline hazard), and finally illustrate the performance improvement
from using the block Metropolis-Hastings algorithm.
The data as given in (Cox 1972) is as follows:
Treatment Group | Observed Duration of Remission (weeks) (* denotes right-censored observations) |
---|---|
Group 1 (6-MP) | 6*, 6, 6, 6, 7, 9*, 10*,10, 11*, 13, 16, 17*, 19*, 20*, 22, 23, 25*, 32*, 32*, 34*, 35* |
Group 2 (control) | 1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23 |
We begin by creating a study time variable time
(time of
event occurrence or right censoring), an event/censoring indicator
variable event
(1 for event occurrence, 0 for right
censoring), and a treatment group variable treat
(1 for
treatment, 2 for control):
time <- c(6,6,6,6,7,9,10,10,11,13,16,17,19,20,22,23,25,32,32,34,35,
1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23)
event <- c(0,1,1,1,1,0,0,1,0,1,1,0,0,0,1,1,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
treat <- c(rep(1,21),rep(2,21))
Next we expand the data into person-time format, with one
observation for each discrete time point each patient was at risk for
event occurrence. We first set the total number of person-time
observations N
and create matrices of height N
to store the predictor values x
(with first column subject
id, second column time, and third column treatment group) and binary
response variable y
:
N <- sum(time) # total number of person-time observations
x <- matrix(0,N,3) # columns id, time, and treatment group
colnames(x) <- c("id","t","treat")
y <- matrix(0,N,1) # only one column since only one competing risk
We then iterate through the 42 observations in the original
(unexpanded) data set, expanding each one and adding the corresponding
rows to x
and y
:
next_row <- 1 # next available row in the person-time matrices
for (i in seq_along(time)) {
rows <- next_row:(next_row + time[i] - 1) # person-time rows to fill
x[rows,1] <- rep(i,time[i]) # subject id number is constant for each person
x[rows,2] <- seq_len(time[i]) # discrete time is integers 1,2,...,time[i]
x[rows,3] <- rep(treat[i],time[i]) # treatment group is constant
y[rows,1] <- c(rep(0,time[i] - 1),event[i]) # outcome is 0's until study time
next_row <- next_row + time[i] # increment the next available row pointer
}
Finally, we will create a copy of the predictor matrix x
specific to the model we want to fit by cutting the time variable \(t\) into intervals \((c_{k-1},c_k]\) as described above and
retaining the treatment variable (the subject id variable is not needed
in model fitting if we have no random effects):
x_brea <- matrix(0,N,2)
c_k <- seq(0,36,3) # step boundaries
x_brea[,1] <- cut(x[,2],c_k,labels=FALSE) # grouped time t
x_brea[,2] <- x[,3] # treatment group
Our step boundaries \(c_k\) are multiples of 3 weeks, so our step widths are all three weeks, giving us 12 steps in total. We will illustrate other choices for the step configuration later in this section.
Before fitting the model, we also need to create a specification of
the type of predictor and associated prior parameters for each of our
two covariates. In brea
, a prior specification is a list
with one item for each predictor. Each item in turn is also a list, in
which the first list element specifies the type of predictor, which here
is either "gmrf"
for quantitative covariates using the
smoothed step function approach described above or "cat"
for categorical covariates.
The remaining list elements set prior hyperparameters. Detailed discussion of hyperparameter specification is beyond the scope of this tutorial, and we again refer the reader to (King and Weiss 2021) for a detailed technical discussion. Here, we will just use noninformative versions of the prior specifications, which may generally be used as default prior specifications for their respective covariate types. These may be specified for our two covariates as follows:
To obtain posterior inferences, we use the brea_mcmc
function, which has the following function signature:
brea_mcmc(x, y, priors = NULL, S = 1000, B = 100, n = NULL, K = NULL,
store_re = FALSE, block_mh = TRUE)
We have already created the x
, y
, and
priors
arguments, and the only other arguments that concern
us now are the number of post-burn-in MCMC iterations S
as
well as the number B
of initial iterations discarded as
burn-in. We will set these to 10 times their default values to ensure
our inferences in the next section are sufficiently accurate. We will
discuss accuracy assessment in more detail later in the final section.
Our MCMC run may be executed as follows:
Because MCMC algorithms use pseudo-random number generation, results will generally differ on repeated runs even with identical function calls. Hence, we have set the seed of the random number generator to ensure that our results are exactly reproducible.
The structure of the R object returned by brea_mcmc
is
below:
## List of 4
## $ b_0_s: num [1, 1:10000] -2.52 -2.93 -2.81 -2.68 -3.02 ...
## $ b_m_s:List of 2
## ..$ : num [1, 1:12, 1:10000] 0.08563 0.10531 0.05935 0.00911 -0.04625 ...
## ..$ : num [1, 1:2, 1:10000] -0.536 0.536 -0.783 0.783 -0.783 ...
## $ s_m_s:List of 2
## ..$ : num [1, 1:10000] 0.0787 0.062 0.0865 0.0846 0.0992 ...
## ..$ : NULL
## $ b_m_a:List of 2
## ..$ : int 8907
## ..$ : int [1:2] 4490 4745
The list component that interests us now is b_m_s
, which
contains posterior sampled values of the parameters representing the
covariate effects. Specifically, b_m_s[[m]][1,k,s]
is the
\(s^\text{th}\) sampled value (post
burn-in) of the effect of the \(k^\text{th}\) possible covariate value of
the \(m^\text{th}\) predictor on the
hazard of event occurrence (the 1 index in [1,k,s]
denotes
that we are examining the first competing risk, which must be
specified even in cases like this where there is only one competing
risk). In the Introduction to brea
vignette, we
examined the treatment variable (m=2
) effect in detail;
here, we will look at the effect of the discrete time \(t\) variable (m=1
).
Posterior medians of the time effect parameters on the linear predictor scale may be calculated as follows:
## [1] -0.0944690686 -0.0599252086 -0.0335874917 -0.0054899807 0.0003766892
## [6] 0.0111216229 0.0206079799 0.0505348763 0.0352363812 0.0265420304
## [11] 0.0222435188 0.0195390234
These values are not directly interpretable, since they represent the logarithms of the factors by which the hazard (or more technically, the odds of the hazard) differs from the overall average hazard. Hence, we exponentiate these values to get them on the hazard ratio scale:
## [1] 0.9098559 0.9418350 0.9669703 0.9945251 1.0003768 1.0111837 1.0208218
## [8] 1.0518335 1.0358645 1.0268974 1.0224928 1.0197312
Finally, we may visualize the corresponding estimated step heights in our step function (on the hazard ratio scale) as follows:
par(cex=0.66,mgp=c(1.75,0.5,0),mai=c(0.4,0.4,0.5,0.1))
# setup the plotting window:
plot(NA,NA,xlim=range(c_k),ylim=range(hr),log="y",
xlab="Time at Risk (weeks)",ylab="Relative Discrete Hazard",
main="Time Effect on Hazard of Remission Cessation")
# add each of the K=12 steps:
for (k in seq_len(length(c_k)-1)) {
lines(c(c_k[k],c_k[k+1]),rep(hr[k],2))
}
We can see the that the hazard of remission cessation increases steadily until it peaks just after 20 weeks, leveling off thereafter.
For our initial model fitting above, we used a step function approximation of the functional relationship between discrete time \(t\) and the hazard of event occurrence with a total of \(K=12\) steps, all with an equal width of 3 weeks. Here, we will compare this formulation to ones with either coarser or finer formulations (i.e., using fewer or greater numbers of steps). In particular, we will now obtain inferences from models with step widths of 6 weeks (6 steps total), 2 weeks (18 steps), and 1 week (36 steps), and we will compare the results to the model with a step width of 3 (12 steps) used earlier.
c_k6 <- seq(0,36,6) # 6-week apart step boundaries
x_brea[,1] <- cut(x[,2],c_k6,labels=FALSE) # grouped time t
set.seed(1234)
fit6 <- brea_mcmc(x_brea,y,priors,10000,1000)
c_k2 <- seq(0,36,2) # 2-week apart step boundaries
x_brea[,1] <- cut(x[,2],c_k2,labels=FALSE) # grouped time t
set.seed(1234)
fit2 <- brea_mcmc(x_brea,y,priors,10000,1000)
c_k1 <- seq(0,36,1) # 1-week apart step boundaries
x_brea[,1] <- cut(x[,2],c_k1,labels=FALSE) # grouped time t
set.seed(1234)
fit1 <- brea_mcmc(x_brea,y,priors,10000,1000)
Note that for the last formulation (36 steps each of length just 1
week), we could have simply used the original time variable
x[,2]
, since this discrete time variable already consisted
of postive integer values (and hence no grouping/discretization
occurred).
As before, we will calculate posterior medians of the time effect in each case on the hazard ratio scale:
hr6 <- exp(apply(fit6$b_m_s[[1]][1,,],1,median))
hr2 <- exp(apply(fit2$b_m_s[[1]][1,,],1,median))
hr1 <- exp(apply(fit1$b_m_s[[1]][1,,],1,median))
We may then plot the step functions for all four choices of step width on the same graph:
par(cex=0.66,mgp=c(1.75,0.5,0),mai=c(0.4,0.4,0.5,0.1))
# setup the plotting window:
plot(NA,NA,xlim=range(c_k),ylim=range(c(hr,hr6,hr2,hr1)),log="y",
xlab="Time at Risk (weeks)",ylab="Relative Discrete Hazard",
main="Time Effect on Hazard of Remission Cessation")
for (k in seq_len(length(c_k6)-1)) {
lines(c(c_k6[k],c_k6[k+1]),rep(hr6[k],2),col="red")
}
for (k in seq_len(length(c_k)-1)) {
lines(c(c_k[k],c_k[k+1]),rep(hr[k],2))
}
for (k in seq_len(length(c_k2)-1)) {
lines(c(c_k2[k],c_k2[k+1]),rep(hr2[k],2),col="green")
}
for (k in seq_len(length(c_k1)-1)) {
lines(c(c_k1[k],c_k1[k+1]),rep(hr1[k],2),col="blue")
}
legend("topleft",
legend=c("6-week Steps","3-week Steps","2-week Steps","1-week Steps"),
col=c("red","black","green","blue"),lty=1)
We can see immediately that the coarsest formulation (6-week steps, in red) does not capture the apparent magnitude of the time effect variation across the 36 study weeks. The 3-week (black) and 2-week (green) step formulations do a better job of capturing the time effect variation, but still do not have adequate granularity to capture either the peak in the hazard rate around 22–23 weeks or the greatly diminished hazard of remission cessation at the very beginning of the study.
We will now compare the efficiency of the block Metropolis-Hastings
algorithm (selecting the block_mh = TRUE
option for the
brea_mcmc
function) to the random walk Metropolis algorithm
(block_mh = FALSE
). In particular, we will examine the
efficiency for inferences about the time effect parameters, since we
expect efficiency in general for such parameters to be especially
problematic given their high posterior correlation induced by the use of
the GMRF prior.
For our comparison, we will use the original 3-week step length
formulation; results for the other formulations are either similar or
show even greater efficiency improvement for the Metropolis-Hastings
version. We have already obtained inferences from the 3-week step model
using Metropolis-Hastings, since block_mh = TRUE
is the
default option for the brea_mcmc
function. We now obain
inferences using random walk Metropolis for comparison purposes:
c_k <- seq(0,36,3) # step boundaries
x_brea[,1] <- cut(x[,2],c_k,labels=FALSE) # grouped time t
set.seed(1234)
fit_rwm <- brea_mcmc(x_brea,y,priors,10000,1000,block_mh = FALSE)
Next, we need to calculate the MCMC efficiency of our posterior samples for each of the \(K=12\) parameters representing the time effect. Efficiency is defined as the percentage of the information contained in an independent sample of the same size that is contained in our MCMC sample. For example, if we have an MCMC sample of size \(S=\) 1,000, and this sample contains one tenth of the amount of information contained in an independent sample of size 1,000, then the MCMC efficiency is 10%. Similarly, we define the effective sample size of an MCMC sample to be the size of an independent sample that contains the same amount of information as our MCMC sample. So for example, if an MCMC sample of size 1,000 contains the same amount of information as an independent sample of size 100, then the effective sample size is 100. These two metrics are related, since the MCMC efficiency equals the effective sample size divided by the size \(S\) of the MCMC sample.
To calculate the MCMC efficiency for our vector of 12 time effect
parameters, we will use the effectiveSize
function from the
coda
package to calculate effective sample sizes:
# make the effectiveSize function from the coda package available:
library(coda)
# MCMC sample size:
S <- 10000
# MCMC efficiency for the default Metropolis-Hastings algorithm:
eff_mh <- apply(fit$b_m_s[[1]][1,,],1,effectiveSize)/S
# MCMC efficiency for the random walk Metropolis algorithm:
eff_rwm <- apply(fit_rwm$b_m_s[[1]][1,,],1,effectiveSize)/S
We may then examine the computed efficiencies as well as the factor by which efficiency improved when moving from random walk Metropolis to Metropolis-Hastings:
## [1] 0.01692673 0.01494314 0.01708034 0.02282697 0.03759462 0.05415988
## [7] 0.06363047 0.03311775 0.02149289 0.01689545 0.01441453 0.01943719
## [1] 0.3277317 0.4817151 0.5129773 0.4883707 0.6315530 0.5639637 0.6776411
## [8] 0.3099692 0.5367384 0.5119067 0.3846150 0.4203444
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01441 0.01692 0.02047 0.02771 0.03424 0.06363
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3100 0.4114 0.5001 0.4873 0.5435 0.6776
## [1] 19.361788 32.236545 30.033201 21.394456 16.799026 10.412942 10.649632
## [8] 9.359609 24.972835 30.298498 26.682456 21.625774
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.36 15.26 21.51 21.15 27.52 32.24
We can see that the typical efficiency (across the \(K=12\) parameters jointly representing the effect of time \(t\)) of the random walk Metropolis algorithm is around 2%, whereas the typical efficiency of the block Metropolis-Hastings algorithm is around 50%. In addition, the factor of efficiency improvement is typically around \(20x\) and ranges between \(10x\) and \(30x\) across the 12 parameters. We may also visualize these improvement levels:
par(cex=0.66,mgp=c(1.75,0.5,0),mai=c(0.4,0.4,0.5,0.1))
K <- 12
plot(NA,NA,xlim=c(1,K),ylim=c(0,0.7),xlab="Parameter Number",
ylab="MCMC Efficiency",xaxt="n")
axis(1,1:K,1:K)
abline(h=0)
lines(1:K,eff_rwm,col="red",type="b",pch=20)
lines(1:K,eff_mh,col="blue",type="b",pch=20)
legend("topleft",
legend=c("Block Metropolis-Hastings","Random Walk Metropolis"),
col=c("blue","red"),lty=1,pch=20)
The efficiency of each of the respective algorithms depends on the amount of data available to estimate each individual parameter as well as how much the data indicate the hazard is changing in between steps. More specifically, random walk Metropolis performs the worst when there is little data for each step, since this leads to higher posterior correlation between model parameters. In contrast, the block Metropolis-Hastings algorithm’s performance is more consistent, but tends to be worst when the data for a certain step height “conflicts” with the prior (e.g., the data is pulling the \(8^\text{th}\) step height in our example higher, while the prior is pulling this step’s height lower from both sides). This is likely due to innaccuracy in the approximations used to propose parameter updates for the block Metropolis-Hastings algorithm in such cases.
Finally, we note that updating parameters for a particular covariate
using the block Metropolis-Hastings algorithm takes about three times as
much computation time per sampled value as when using random walk
Metropolis. This means the computation time per MCMC iteration can be up
to three times longer (usually this time penalty is smaller though,
since only covariates with the gmrf
prior type use the
block Metropolis-Hastings algorithm). Still, with per-iteration MCMC
efficiency increasing by a factor of 20 or more, overall real-time
efficiency is still often improved by a factor of 10 or more.