---
title: "Cross-validation with the landmaRk package"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Cross-validation with the landmaRk package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


## Setup

In addition to the `landmaRk` package, we will also use `tidyverse`. 

```{r setup}
set.seed(123)
library(landmaRk)
library(tidyverse)
library(survival)
library(prodlim)
```

## Example: `aids` data

As in the first vignette, we use the epileptic dataset to perform landmarking
analysis  of time-to-event data with time-varying covariates. Here is the
structure of  the dataset.

```{r}
library(JMbayes2)
data("aids")
aids$patient <- as.numeric(aids$patient)
str(aids)
```


## Initialising the landmarking analysis for cross-validation

As in the introductory vignette, the dataset `epileptic` comes in wide format.
We split it into two dataframes, one for static covariates and one for dynamic
covariates.

```{r}
# DF with Static covariates
aids_dfs <- split_wide_df(
  aids,
  ids = "patient",
  times = "obstime",
  static = c("Time", "death", "drug", "gender", "prevOI"),
  dynamic = c("CD4"),
  measurement_name = "value"
)
static <- aids_dfs$df_static
head(static)
```

```{r}
# DF with Dynamic covariates
dynamic <- aids_dfs$df_dynamic
head(dynamic[["CD4"]])
```


As in the introductory vignette, we create an object of class `LandmarkAnalysis`,
using the helper function of the same name. We now use the optional argument `K`
to specify the number of cross-validations folds. In this example, we request 
five cross validation folds.

We then calculate the risk sets using `compute_risk_sets()`. 
```{r}
landmarking_object <- LandmarkAnalysis(
  data_static = static,
  data_dynamic = dynamic,
  event_indicator = "death",
  ids = "patient",
  event_time = "Time",
  times = "obstime",
  measurements = "value",
  K = 5
)

landmarking_object <- landmarking_object |>
  compute_risk_sets(landmarks = c(6, 8))
```

## Performance evaluation in hold-out dataset

Now that we have split the dataset into `K=5` parts for cross-validation,
we can use one of the five parts as test set and the remaining four parts as
the training set. To do so, use the argument `validation_fold` to indicate the
that you  want to use as test set when calling `fit_longitudinal()`,
`predict_longitudinal()`, `fit_survival()` and `predict_survival()`. 


```{r}
landmarking_object <- landmarking_object |>
  fit_longitudinal(
    landmarks = c(6, 8),
    method = "lme4",
    formula = value ~ prevOI + obstime + (obstime | patient),
    dynamic_covariates = c("CD4"),
    validation_fold = 5
  ) |>
  predict_longitudinal(
    landmarks = c(6, 8),
    method = "lme4",
    dynamic_covariates = c("CD4"),
    validation_fold = 5,
  ) |>
  fit_survival(
    formula = Surv(event_time, event_status) ~ drug,
    landmarks = c(6, 8),
    horizons = 12 + c(6, 8),
    method = "coxph",
    dynamic_covariates = c("CD4"),
    validation_fold = 5
  ) |>
  predict_survival(
    landmarks = c(6, 8),
    horizons = 12 + c(6, 8),
    validation_fold = 5
  )
```

We can also use `summary()` to print parameter estimates. Note that
this time the sample size is smaller, because we have held out part of the
original dataset for validation. 


```{r}
summary(landmarking_object,
        type = "longitudinal",
        landmark = 6,
        dynamic_covariate = "CD4")
```

```{r}
summary(landmarking_object, type = "survival", landmark = 6, horizon = 18)
```

Here are the in-sample performance metrics: 

```{r}
performance_metrics(
  landmarking_object,
  landmarks = c(6, 8),
  horizons = c(18, 20),
  auc_t = TRUE, c_index = FALSE,
  h_times = c(3, 6, 12)
)
```

Out-of-sample performance metrics can be obtained by specifying `train = FALSE`:

```{r}
performance_metrics(
  landmarking_object,
  landmarks = c(6, 8),
  horizons = c(18, 20),
  auc_t = TRUE, c_index = FALSE,
  h_times = c(3, 6, 12),
  train = FALSE
)
```

## Cross-validation

Now, we can embed the above pipeline in a for loop to perform cross-validation:

```{r}
landmarking_object <- LandmarkAnalysis(
  data_static = static,
  data_dynamic = dynamic,
  event_indicator = "death",
  ids = "patient",
  event_time = "Time",
  times = "obstime",
  measurements = "value",
  K = 5
)

landmarking_object <- landmarking_object |>
  compute_risk_sets(landmarks = c(6, 8))
```

```{r}
metrics <- list()
for (k in 1:5) {
  metrics[[k]] <- landmarking_object |>
    fit_longitudinal(
      landmarks = c(6, 8),
      method = "lme4",
      formula = value ~ prevOI + obstime + (obstime | patient),
      dynamic_covariates = c("CD4"),
      validation_fold = k
    ) |>
  predict_longitudinal(
    landmarks = c(6, 8),
    method = "lme4",
    dynamic_covariates = c("CD4"),
    validation_fold = k,
    allow.new.levels = TRUE
  ) |>
  fit_survival(
    formula = Surv(event_time, event_status) ~ drug,
    landmarks = c(6, 8),
    horizons = 12 + c(6, 8),
    method = "coxph",
    dynamic_covariates = c("CD4"),
    validation_fold = k
  ) |>
  predict_survival(
    landmarks = c(6, 8),
    horizons = 12 + c(6, 8),
    validation_fold = k
  ) |>
    performance_metrics(
      landmarks = c(6, 8),
      horizons = c(18, 20),
      auc_t = TRUE, brier = TRUE, c_index = FALSE,
      h_times = c(3, 6, 12)
    )
}

metrics
```