This tutorial explains how to check that the fitted model is compatible with the observed data. Basically, we predict the data we would expect based on the model posterior, and we compare it with the actually observed data to see if they are consistent.
Posterior predictive checks are important to ensure that the model makes sense to explain the observed data. Typically, before doing a posterior predictive check, one would check that the MCMC run was successful and that no problem arised during the MCMC itself (see the previous vignette about Post-run diagnostics).
library(isotracer)
library(tidyverse)
To illustrate posterior predictive checks, we will use the same simulated data as in the Quick Start tutorial:
<- tibble::tribble(
exp ~time.day, ~species, ~biomass, ~prop15N,
0, "algae", 1.02, 0.00384,
1, "algae", NA, 0.0534,
1.5, "algae", 0.951, NA,
2, "algae", 0.889, 0.0849,
2.5, "algae", NA, 0.0869,
3, "algae", 0.837, 0.0816,
0, "daphnia", 1.74, 0.00464,
1, "daphnia", NA, 0.00493,
1.5, "daphnia", 2.48, NA,
2, "daphnia", NA, 0.00831,
2.5, "daphnia", 2.25, NA,
3, "daphnia", 2.15, 0.0101,
0, "NH4", 0.208, 0.79,
1, "NH4", 0.227, NA,
1.5, "NH4", NA, 0.482,
2, "NH4", 0.256, 0.351,
2.5, "NH4", NA, 0.295,
3, "NH4", 0.27, NA
)
# Separate initial conditions and observations
<- exp %>% filter(time.day == 0)
inits <- exp %>% filter(time.day > 0)
obs
# Build the network model
<- new_networkModel() %>%
mod set_topo("NH4 -> algae -> daphnia -> NH4") %>%
set_init(inits, comp = "species", size = "biomass", prop = "prop15N") %>%
set_obs(obs, time = "time.day")
## Using default distribution family for proportions ("gamma_cv").
## (eta is the coefficient of variation of gamma distributions.)
## Using default distribution family for sizes ("normal_cv").
## (zeta is the coefficient of variation of normal distributions.)
## Using the same columns by default as the ones used with `set_init()`:
## comp = "species"
## size = "biomass"
## prop = "prop15N"
# Set reasonable but vague priors for this model
<- set_priors(mod, normal_p(0, 5)) mod
## Prior modified for parameter(s):
## - eta
## - lambda_algae
## - lambda_daphnia
## - lambda_NH4
## - upsilon_algae_to_daphnia
## - upsilon_daphnia_to_NH4
## - upsilon_NH4_to_algae
## - zeta
We run the MCMC sampler to fit the model:
<- run_mcmc(mod, iter = 1000)
fit plot(fit)
# Note: the figure below only shows a few of the traceplots for vignette concision
A posterior predictive check is the comparison between what the fitted model predicts and the actual observed data. The aim is to detect if the model is inadequate to describe the data.
The fastest and simplest way to perform a posterior predictive check for a network model is to predict the trajectories of the compartment sizes and proportions of marked material, and compare it with the data points. This is what we did in previous tutorials:
<- mod %>% predict(fit, probs = 0.95)
predictions plot(predictions, ylab.size = "Biomass (mg N)", ylab.prop = "Proportion 15N")
We would expect 95% of the observations to fall within the predicted 95% credible intervals if the model is appropriate to describe the dataset.
If you want a finer control of the posterior predictive checks, you
can obtain tidy posterior predictions with tidy_dpp()
. Tidy
posterior predictions are convenient if you want to perform some custom
checks or examine some predictions in particular. Additionally, the
output from tidy_dpp()
can be used with the functions
provided by the bayesplot package for posterior
predictive checks.
Posterior predictive checks rely on three objects:
y
: the observed data (a vector with N observations)y_rep
: the predicted data (a D×N matrix, where D is the number of posterior draws from
which the predictions are generated).vars
: a table of grouping variables with N rows which indicate how observations in
y
are structured.Given the complexity of network model data, which are time series for
sizes and proportions of different compartments, the
tidy_dpp()
does the heavy work of preparing tidy versions
of y
, y_rep
and vars
from the
network model and the corresponding MCMC run:
<- tidy_dpp(mod, fit)
z names(z)
# Note that z can be filtered based on the columns in z[["vars"]].
head(z[["vars"]])
<- z %>% filter(type == "size")
z_size <- z %>% filter(type == "proportion") z_prop
## [1] "y" "y_rep" "vars"
## # A tibble: 6 × 4
## compartment time type group.
## <chr> <dbl> <chr> <list>
## 1 algae 1 proportion <NULL>
## 2 algae 1.5 size <NULL>
## 3 algae 2 proportion <NULL>
## 4 algae 2 size <NULL>
## 5 algae 2.5 proportion <NULL>
## 6 algae 3 proportion <NULL>
The elements of z
, z_size
or
z_prop
can be used directly with bayesplot
functions, are illustrated below:
library(bayesplot)
# Distributions
ppc_dens_overlay(z_prop[["y"]], z_prop[["y_rep"]])
ppc_ecdf_overlay(z_size[["y"]], z_size[["y_rep"]])
# Intervals
ppc_intervals(z_prop[["y"]], z_prop[["y_rep"]],
x = z_prop[["vars"]][["time"]]) +
coord_trans(y = "log10") +
facet_grid(~ z_prop[["vars"]][["compartment"]])
# Scatterplot
ppc_scatter_avg(z_prop[["y"]], z_prop[["y_rep"]]) +
coord_trans(x = "log10", y = "log10")