Welcome to this guide on how to work with ctsmTMB
for
modelling time-series data.
In this example we consider a simple 1D mean-reverting stochastic differential equation model, the Ornstein-Uhlenbeck process: \[ \mathrm{d}x_{t} = \theta (\mu - x_{t}) \, \mathrm{d}t \, + \sigma_{x} \, \mathrm{d}b_{t} \] Here \(\theta\), \(\mu\) and \(\sigma_x\) are (fixed effects) parameters to be estimated.
We assume that we have direct observations of the state \(x_{t}\), at discrete times \(t_k\) for \(k=0,1,2,...,N\) i.e. the observation equation is \[ y_{t_{k}} = x_{t_{k}} + \varepsilon_{t_{k}} \]
The residuals are assumed to be Gaussian, and we choose the variance as \[ \varepsilon_{t_{k}} \sim N(0,\sigma_{y}^{2} \cdot u_{t_{k}}) \]
Here \(\sigma_{y}\) is a fixed effect parameter (also to be estimated), and \(u_{t}\) is a known time-dependent input. The input is added here for sake of introduction, just to demonstrate how time-dependent inputs are specified. We could for example imagine the input to be given by \[ u_{t_{k}} = \left\{ 1, 4, 1, 4, 1, 4 \cdots \right\} \] such that certain observations have twice the standard deviation of others, and thus carry less weight (information) about the latent state \(x\). In the remainder of this vignette however, we simply set \(u_{t} = 1\).
We initialise a ctsmTMB
model object using
Printing the model object in the console reveals some basic information about it:
## ctsmTMB model object:
## States 0
## Diffusions 0
## Observations 0
## Inputs 0
## Parameters 0
First we specify the stochastic differential equation governing our latent state \(x_t\). This is straight-forward using R formulas, choosing appropriate character names for the parameters.
We emphasize that drift terms must be multiplied by a dt
and diffusion terms by a dw
or dw#
where # is
some number e.g. dw2
. A single state equation may contain
any number of diffusion terms i.e.
Now the observation equations of the system must be specified. This amounts to specifying the mean and variance of the residual normal distribution. First the mean is specified i.e.
The variable used on the left-hand side, here y
,
determines the name of the relevant observations to-be provided in the
data for maximum likelihood estimation later.
Next we specify the residual variance, using the name given to the
observation equation above, in addObs
:
Next, we declare which variables are time-dependent inputs via
These shall must also be provided in the data later, similar to observations.
We must also specify the (fixed effects) parameters, and in addition their initial/lower/upper values for the estimation.
model$setParameter(
theta = c(initial = 5, lower = 0, upper = 20),
mu = c(initial = 0, lower = -10, upper = 10),
sigma_x = c(initial = 1e-1, lower = 1e-5, upper = 5),
sigma_y = c(initial = 1e-1, lower = 1e-5, upper = 5)
)
A parameter can be fixed by supplying just a single value. It is for instance typically useful to fix observation noise parameters because simultaneous identification of observation and process noise is difficult in practice, and some knowledge about observation noise may be known. Thus, let us fix \(\sigma_{y}\) to a somewhat arbitrary but reasonable value:
Let’s inspect the model object again, and see that it is no longer empty:
## ctsmTMB model object:
## States 1
## Diffusions 1
## Observations 1
## Inputs 1
## Parameters 4
##
## System Equations:
##
## dx ~ theta * (mu - x) * dt + sigma_x * dw
##
## Observation Equations:
##
## y: y ~ x + e e ~ N(0, sigma_y^2 * u)
##
##
## Free Parameters:
## theta, mu, sigma_x
##
## Fixed Parameters:
## sigma_y
Lastly the state distribution at the initial time point must be specified via its mean and variance.
Note that in higher dimensions the provided covariance
cov
must be a matrix. A simple initial covariance could
just be a scaled identity e.g. in two dimensions
cov = 1e-1 * diag(2)
The model has now been fully specified, and state and parameter estimation can be performed on the data at hand.
In this particular example we generate some fake data. This can be achieved by simulating a stochastic realization of the stochastic differential equation, and adding some observation noise to it.
We achieve this task using the simulate
method of the
model object, which perform stochastic simulations based on the Euler-Maruyama
scheme. The user is referred to the simulation
vignette for further details.
We choose the true parameters to be \[ \theta = 10.0 \qquad \mu=1.00 \qquad \sigma_{x} = 1.00 \qquad \sigma_{y}=0.05 \]
The code below performs the simulation and prepares data for likelihood estimation:
library(ggplot2)
# Set simulation settings
set.seed(11)
true.pars <- c(theta=10, mu=1, sigma_x=1, sigma_y=0.05)
dt.sim <- 1e-3
t.end <- 1
t.sim <- seq(0, t.end, by=dt.sim)
df.sim <- data.frame(t=t.sim, u=1, y=NA)
# Perform simulation
sim <- model$simulate(data=df.sim,
pars=true.pars,
n.sims=1,
silent=T,
initial.state=initial.state)
x <- sim$states$x$i0$x1
# Extract observations from simulation and add noise
iobs <- seq(1,length(t.sim), by=10)
t.obs <- t.sim[iobs]
y = x[iobs] + true.pars["sigma_y"] * rnorm(length(iobs))
# Create data-frame
data <- data.frame(
t = t.obs,
u = 1,
y = y
)
# Plot the simulation and observed data
ggplot() +
geom_line(aes(x=t.sim,y=x,color="Simulation")) +
geom_point(aes(x=t.obs,y=y,fill="Observations")) +
ctsmTMB:::getggplot2theme() + labs(x="t", y="x",color="",fill="")
We can now pass the data
to the estimate
method. This will build the model, perform various checks, construct the
computational graph for automatic differentiation, and then perform the
optimization.
Note: The data must contain an
increasing time column named t
and columns for each of the
specified inputs and observations, in this case u
and
y
.
## Checking data...
## Constructing objective function and derivative tables...
## ...took: 0.052 seconds.
## Minimizing the negative log-likelihood...
## 0: 1400.0171: 5.00000 0.00000 0.100000
## 1: -64.612226: 4.98142 0.0913549 1.09565
## 2: -74.658900: 4.79393 1.14526 1.00122
## 3: -74.682470: 4.52045 1.00665 1.00785
## 4: -74.734734: 4.66161 1.06392 1.00501
## 5: -74.745518: 4.73877 1.07682 1.00450
## 6: -74.779976: 5.00284 1.09713 1.00396
## 7: -74.860990: 5.64276 1.11379 1.00408
## 8: -75.070516: 7.18786 1.10429 1.00635
## 9: -75.265661: 11.3571 1.00526 1.01491
## 10: -75.356843: 9.53201 1.05607 1.00844
## 11: -75.394350: 9.92466 1.04390 1.00746
## 12: -75.453133: 10.4866 1.02483 1.00122
## 13: -75.527071: 10.6227 1.01657 0.989961
## 14: -75.768819: 10.2550 1.00728 0.936849
## 15: -75.842025: 9.50002 1.01958 0.912180
## 16: -75.851608: 9.27045 1.02899 0.911285
## 17: -75.852200: 9.26599 1.03199 0.912005
## 18: -75.852208: 9.27461 1.03231 0.912029
## 19: -75.852209: 9.27737 1.03232 0.912025
## 20: -75.852209: 9.27778 1.03232 0.912024
## Optimization finished!:
## Elapsed time: 0.004 seconds.
## The objective value is: -7.585221e+01
## The maximum gradient component is: 5.4e-05
## The convergence message is: relative convergence (4)
## Iterations: 20
## Evaluations: Fun: 22 Grad: 21
## See stats::nlminb for available tolerance/control arguments.
## Returning results...
## Finished!
Note: a time-consuming step in the estimation procedure is construction of the AD graph of the underlying likelihood function, but the time spent for this task relative to the optimization time will even out for models with more parameters.
The output generated during the optimization is the objective (negative log-likelihood) value and the parameter values at the current step. The optimizer used by ctsmTMB is the nlminb optimizer from the stats library.
We refer the user to the estimation
vignette for further details on the available arguments to
estimate
, and to the ‘use
another optimizer’ vignette for details on how to use another
optimizer than nlminb.
fit
objectThe fit
object contains the following entries
## [1] "convergence" "nll" "nll.gradient" "nll.hessian" "par.fixed"
## [6] "sd.fixed" "cov.fixed" "states" "residuals" "observations"
## [11] "tvalue" "Pr.tvalue" "private"
The first is a boolean which indicates whether estimation was successful
## [1] 0
which is just a copy of the optimization message from
stats::nlminb
.
The second is the likelihood value at the found optimum
## [1] -75.85221
the third is the likelihood gradient at the optimum
## theta mu sigma_x
## -2.315829e-07 5.350033e-05 7.049035e-06
and the fourth is the likelihood hessian at the optimum
## theta mu sigma_x
## theta 0.05211939 -0.05360169 -0.5703124
## mu -0.05360169 102.97419206 0.4686782
## sigma_x -0.57031238 0.46867821 114.4596547
Printing the fit
object reveals a standard coefficient
matrix for the parameter estimates.
## Coefficent Matrix
## Estimate Std. Error t value Pr(>|t|)
## theta 9.277779 4.505959 2.0590 0.04207 *
## mu 1.032320 0.098572 10.4728 < 2.2e-16 ***
## sigma_x 0.912024 0.096128 9.4876 1.207e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see the parameter estimate and the associated standard error together with a one-dimensional t-test statistic and associated P-value for the common hypothesis test \[ H_{0}: p = 0 \\ H_{1}: p \neq 0 \]
Note The large uncertainty in \(\theta\) here is primarily caused by a relatively short time-series (2 seconds), relative to the characteristic time of the process \(\tau = 1/\theta = 0.1 \, \text{sec}\).
The parameter-related information can be extracted from the fit object. The estimated (fixed) parameters:
## theta mu sigma_x
## 9.2777787 1.0323200 0.9120244
The standard deviations of the (fixed) parameters:
## theta mu sigma_x
## 4.50595858 0.09857172 0.09612767
The covariance of the (fixed) parameters:
## theta mu sigma_x
## theta 20.30366268 1.010851e-02 1.011246e-01
## mu 0.01010851 9.716385e-03 1.058147e-05
## sigma_x 0.10112465 1.058147e-05 9.240528e-03
The parameter covariance is found by inverting the likelihood hessian at the found optimum i.e.:
## theta mu sigma_x
## theta 20.30366268 1.010851e-02 1.011246e-01
## mu 0.01010851 9.716385e-03 1.058147e-05
## sigma_x 0.10112465 1.058147e-05 9.240528e-03
The optimal state distributions associated with the estimated parameters can be extracted from the model object as well.
In this example we used the Extended Kalman filter (this is the
default filtering algorithm specified by the argument
method="ekf"
to estimate
). This method
produces prior and posterior state estimates. The
prior estimates are one-step-ahead predictions, while the posterior
estimated are these priors but updated against the current-time
available observations. The user is referred to the Kalman
Filter vignette for more information on the theoretical details.
The states can easily be plotted using the provided S3 plot.ctsmTMB.fit method. Here we plot both prior and posterior states against the observations:
Note: The decrease in variance for the posterior state estimate is expected because these states are updated to the observations.
Model validation typically involves inspecting the properties of
prediction residuals. The model residuals are contained in
fit$residuals
with the entries:
## [1] "residuals" "sd" "normalized" "cov"
We can also easily generate a residual analysis plot with the
S3 plot.ctsmTMB.fit method. This is actually the
default arguments to plot.ctsmTMB.fit
. This produces a
time-series of the residuals, a histogram, a quantile-quantile plot,
auto/partial-correlations and a cumulative periodogram:
We can perform likelihood profiling with the profile
S3
method on the fit
object, and further plot the result by
calling the plot
S3 method on that.
For an example we can inspect the profile likelihood of \(\theta\) as follows:
a <- fit$par.fixed["theta"] - 5
b <- fit$par.fixed["theta"] + 5
prof <- profile(fit, list("theta"=seq(a,b,length.out=25)), silent=TRUE)
plot(prof)
The model definitions can be kept clean by defining algebraic expressions which replace variables in the defined equations. A typical scenario where algebraic equations can be used is to rename parameters which must be strictly positive.
Example In this model \(\theta\) should be a strictly positive parameter \(\hat{\theta} = \exp(\log\theta)\). This can be achieved by setting the following algebraic expression:
The effect of this is to replace all occurrences of
theta
in the model equations with
exp(logtheta)
. The final thing to do is to add a new
parameter entry to the model object which describes values for
logtheta