The merlin
package allows the fitting of multi-outcome models and with any number of nested random effects, outcomes can be modelled jointly, or with shared random effects.
For n record to be included in the model there must be at least one observation per level in the model. For example in a joint survival and longitudinal biomarker model, if the survival time for a patient is available, but all longitudinal biomarker observations are missing, the individual will be excluded from the analysis.
For each individual outcome in the model, a complete case analysis is used. The response, covariates and level indicators are required to fit the model, which may vary by outcome. For example if a model simultaneously models two separate biomarkers, which are sampled at different time points, data in each record will only be included in the model for the appropriate outcome, potentially resulting in different sample sizes being used for each outcome.
There is a separate variance-covariance matrix for each level of random effects. The structure of the variance-covariance matrix can be varied to allow for correlation between parameters at each level. The possible structures are identity
, where all diagonal elements are estimated and constrained to be the same, diagonal
where all diagonal elements are estimated uniquely (the default), unstructured
where all elements of the variance-covariance matrix are estimated, and exchangeable
which assumes a common variance and a common covariance.
Predictions using the fixed-effects only can be found using the predict()
function with the argument type = fixedonly
. Marginal predictions, where the random effects are integrated out, can be obtained using type = marginal
.
In order to illustrate the potential uses of merlin
a number of increasingly advanced models have been fitted to the commonly used (in the area of joint modelling of longitudinal and survival data) primary biliary cirrhosis dataset.
This data set needs some re-formatting in order to fit joint models
library(merlin)
data("pbc")
pbc[1:11,c("id","years","status","status2","drug","serBilir","prothrombin","year")]
#> id years status status2 drug serBilir prothrombin year
#> 1 1 1.09517 dead 1 D-penicil 14.5 12.2 0.0000000
#> 2 1 1.09517 dead 1 D-penicil 21.3 11.2 0.5256817
#> 3 2 14.15234 alive 0 D-penicil 1.1 10.6 0.0000000
#> 4 2 14.15234 alive 0 D-penicil 0.8 11.0 0.4983025
#> 5 2 14.15234 alive 0 D-penicil 1.0 11.6 0.9993429
#> 6 2 14.15234 alive 0 D-penicil 1.9 10.6 2.1027270
#> 7 2 14.15234 alive 0 D-penicil 2.6 11.3 4.9008871
#> 8 2 14.15234 alive 0 D-penicil 3.6 11.5 5.8892783
#> 9 2 14.15234 alive 0 D-penicil 4.2 11.5 6.8858833
#> 10 2 14.15234 alive 0 D-penicil 3.6 11.5 7.8907020
#> 11 2 14.15234 alive 0 D-penicil 4.6 11.5 8.8325485
The event times are given in the years
variable and the event indicator in the status
variable which can take the values alive, dead or transplanted or the status2
variable which is 0 for alive and 1 for dead. Each survival observation should only appear once for each individual, otherwise each observation will be treated as a separate event, this allows for recurrent events to be modelled. The survival observations can be reformatted as follows:
pbc$stime <- pbc$years
pbc$stime[duplicated(pbc$id)] <- NA
pbc$died <- pbc$status2
pbc$died[duplicated(pbc$id)] <- NA
We are also going to work with the log of biomarkers prothrombin index and serum bilirubin throughout, the time of this measurements will be recorded in the variable time. We will also change the treatment variable to be numeric rather than a factor variable.
pbc$logb <- log(pbc$serBilir)
pbc$logp <- log(pbc$prothrombin)
pbc$time <- pbc$year
pbc$trt <- as.numeric(pbc$drug) - 1
The data now looks like this. Failing to set the data up in this way will lead to errors in the parameter estimates.
pbc[1:11,c("id","stime","died","logb","logp","time")]
#> id stime died logb logp time
#> 1 1 1.09517 1 2.67414865 2.501436 0.0000000
#> 2 1 NA NA 3.05870707 2.415914 0.5256817
#> 3 2 14.15234 0 0.09531018 2.360854 0.0000000
#> 4 2 NA NA -0.22314355 2.397895 0.4983025
#> 5 2 NA NA 0.00000000 2.451005 0.9993429
#> 6 2 NA NA 0.64185389 2.360854 2.1027270
#> 7 2 NA NA 0.95551145 2.424803 4.9008871
#> 8 2 NA NA 1.28093385 2.442347 5.8892783
#> 9 2 NA NA 1.43508453 2.442347 6.8858833
#> 10 2 NA NA 1.28093385 2.442347 7.8907020
#> 11 2 NA NA 1.52605630 2.442347 8.8325485
We can fit a simple linear model
lin.model <- merlin(logb ~ time, family = "gaussian", data = pbc)
lin.model
#> Merlin: mixed-effects model
#> Data: pbc
#>
#> Coefficients:
#> time _cons log_sd(resid.)
#> 0.01392 0.55956 0.10354
summary(lin.model)
#> Mixed effects regression model
#> Log likelihood = -2961.414
#>
#> Estimate Std. Error z Pr(>|z|) [95% Conf. Interval]
#> time 0.013916 0.008128 1.712 0.08687 -0.002014 0.029846
#> _cons 0.559560 0.035806 15.628 0.00000 0.489382 0.629739
#> log_sd(resid.) 0.103540 0.016032 6.458 0.00000 0.072119 0.134962
We can add flexibility to the model using restricted cubic splines:
rcs.model <- merlin(logb ~ rcs(time, df = 3), family = "gaussian", timevar = "time", data = pbc)
summary(rcs.model)
#> Mixed effects regression model
#> Log likelihood = -2960.733
#>
#> Estimate Std. Error z Pr(>|z|) [95% Conf. Interval]
#> rcs():1 0.043260 0.025147 1.720 0.0854 -0.006028 0.092548
#> rcs():2 0.028845 0.025148 1.147 0.2514 -0.020443 0.078133
#> rcs():3 -0.009251 0.025147 -0.368 0.7130 -0.058539 0.040037
#> _cons 0.603291 0.025147 23.990 0.0000 0.554003 0.652579
#> log_sd(resid.) 0.103511 0.016037 6.455 0.0000 0.072079 0.134943
By default, the restricted cubic splines are orthogonalised (orthog = TRUE
). The serum bilirubin observations are clustered within individuals, so we can add a normally-distributed random intercept term named M1
. The coefficient of the random-effect term will normally constrained to 1 using the *1
notation, unless the random effect is being shared with another outcome model:
r.int.model <- merlin(logb ~ rcs(time, df = 3) + M1[id]*1,
family = "gaussian",
levels = "id",
timevar = "time",
data = pbc)
summary(r.int.model)
#> Mixed effects regression model
#> Log likelihood = -1978.61
#>
#> Estimate Std. Error z Pr(>|z|) [95% Conf. Interval]
#> rcs():1 0.266362 0.013699 19.444 0.00000 0.239513 0.293211
#> rcs():2 0.058592 0.013014 4.502 0.00001 0.033086 0.084098
#> rcs():3 0.016696 0.012695 1.315 0.18845 -0.008186 0.041578
#> _cons 0.614076 0.023175 26.498 0.00000 0.568655 0.659498
#> log_sd(resid.) -0.611477 0.016776 -36.450 0.00000 -0.644357 -0.578597
#> log_sd(M1) -0.127962 0.019616 -6.523 0.00000 -0.166409 -0.089515
#>
#> Integration method: Non-adaptive Gauss-Hermite quadrature
#> Integration points: 7
We can also add an additional random-slope term (M2
) to the model by forming an interaction between the time variable and random-effect using :
. We can increase the number of quadrature nodes to improve estimation of the likelihood using the ip
option within the control
argument.
r.slope.model <- merlin(logb ~ rcs(time, df = 3) + M1[id]*1 + time:M2[id]*1,
family = "gaussian",
timevar = "time",
levels = "id",
data = pbc,
control = list(ip = 15))
summary(r.slope.model)
#> Mixed effects regression model
#> Log likelihood = -1718.679
#>
#> Estimate Std. Error z Pr(>|z|) [95% Conf. Interval]
#> rcs():1 -0.024415 0.016517 -1.478 0.13935 -0.056787 0.007957
#> rcs():2 -0.020279 0.010947 -1.852 0.06396 -0.041736 0.001177
#> rcs():3 -0.025302 0.009456 -2.676 0.00745 -0.043835 -0.006769
#> _cons 0.234964 0.019792 11.872 0.00000 0.196172 0.273756
#> log_sd(resid.) -1.017033 0.017687 -57.502 0.00000 -1.051698 -0.982367
#> log_sd(M1) -0.276734 0.021426 -12.916 0.00000 -0.318728 -0.234740
#> log_sd(M2) -1.432000 0.023614 -60.642 0.00000 -1.478283 -1.385717
#>
#> Integration method: Non-adaptive Gauss-Hermite quadrature
#> Integration points: 15
If a model has random effects, merlin
will fit fixed effects models to obtain starting values and then assume an identity matrix for the variance of the random effects. Initial values can be manually set using the from
option.