This vignette describes the R interface to
NeuralEstimators
, a Julia package that facilitates a suite
of neural methods for simulation-based inference. These methods are
likelihood-free and amortised, in the sense that, once the neural
networks are trained on simulated data, they enable rapid inference
across arbitrarily many observed data sets in a fraction of the time
required by conventional approaches. The package accommodates any model
for which simulation is feasible by allowing users to define their model
implicitly through simulated data.
Comprehensive documentation for the Julia version of the package is available here.
Note on math rendering: When viewing this vignette
online, you may encounter issues with calligraphic math symbols
rendering as “square boxes”. If this occurs, you can adjust your
settings to resolve it by right-clicking on the square box, and then
selecting
Math Settings > Math Renderer > Common HTML
.
Here, we provide an overview of the amortised neural inferential methods supported by the package, which include neural Bayes estimators (Section 1.1), neural posterior estimators (Section 1.2), and neural ratio estimators (Section 1.3). For further details on each of these methods and amortised neural inference more broadly, see the review paper by Zammit-Mangion et al. (2025) and the references therein.
Notation: We denote model parameters of interest by \(\boldsymbol{\theta} \equiv (\theta_1, \dots, \theta_d)' \in \Theta\), where \(\Theta \subseteq \mathbb{R}^d\) is the parameter space. We denote data by \(\boldsymbol{Z} \equiv (Z_1, \dots, Z_n)' \in \mathcal{Z}\), where \(\mathcal{Z} \subseteq \mathbb{R}^n\) is the sample space. We denote neural-network parameters by \(\boldsymbol{\gamma}\). For simplicity, we assume that all measures admit densities with respect to the Lebesgue measure. We use \(\pi(\cdot)\) to denote the prior density function of the parameters. The input argument to a generic density function \(p(\cdot)\) serves to specify both the random variable associated with the density and its evaluation point.
The goal of parametric point estimation is to estimate \(\boldsymbol{\theta}\) from data \(\boldsymbol{Z}\) using an estimator, \(\hat{\boldsymbol{\theta}} :
\mathcal{Z}\to\Theta\). Estimators can be constructed intuitively
within a decision-theoretic framework based on average-risk optimality.
Specifically, consider a loss function \(L:
\Theta \times \Theta \to [0, \infty)\). Then the Bayes risk of
the estimator \(\hat{\boldsymbol{\theta}}(\cdot)\) is
\[
\int_\Theta \int_{\mathcal{Z}} L(\boldsymbol{\theta},
\hat{\boldsymbol{\theta}}(\boldsymbol{Z}))p(\boldsymbol{Z} \mid
\boldsymbol{\theta}) \pi(\boldsymbol{\theta}) \textrm{d} \boldsymbol{Z}
\textrm{d}\boldsymbol{\theta}.
\] Any minimiser of the Bayes risk is said to be a Bayes
estimator with respect to \(L(\cdot,
\cdot)\) and \(\pi(\cdot)\).
Bayes estimators are functionals of the posterior distribution (e.g., the Bayes estimator under quadratic loss is the posterior mean), and are therefore often unavailable in closed form. A way forward is to assume a flexible parametric function for \(\hat{\boldsymbol{\theta}}(\cdot)\), and to optimise the parameters within that function in order to approximate the Bayes estimator. Neural networks are ideal candidates, since they are universal function approximators, and because they are fast to evaluate. Let \(\hat{\boldsymbol{\theta}}_{\boldsymbol{\gamma}} : \mathcal{Z}\to\Theta\) denote a neural network parameterised by \(\boldsymbol{\gamma}\). Then a Bayes estimator may be approximated by \(\hat{\boldsymbol{\theta}}_{\boldsymbol{\gamma^*}}(\cdot)\), where \[ \boldsymbol{\gamma}^* \equiv \underset{\boldsymbol{\gamma}}{\mathrm{arg\,min\;}} \frac{1}{K} \sum_{k = 1}^K L(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}_{\boldsymbol{\gamma}}(\boldsymbol{Z}^{(k)})), \] with \(\boldsymbol{\theta}^{(k)} \sim \pi(\boldsymbol{\theta})\) and, independently for each \(k\), \(\boldsymbol{Z}^{(k)} \sim p(\boldsymbol{Z} \mid \boldsymbol{\theta}^{(k)})\). The process of obtaining \(\boldsymbol{\gamma}^*\) is referred to as “training the network”, and this can be performed efficiently using back-propagation and stochastic gradient descent. The trained neural network \(\hat{\boldsymbol{\theta}}_{\boldsymbol{\gamma^*}}(\cdot)\) approximately minimises the Bayes risk, and therefore it is called a neural Bayes estimator (Sainsbury-Dale et al., 2024).
Once trained, a neural Bayes estimator can be applied repeatedly to observed data (whose structure conforms with the chosen neural-network architecture) at a fraction of the computational cost of conventional inferential methods. It is therefore ideal to use a neural Bayes estimator in settings where inference needs to be made repeatedly; in this case, the initial training cost is said to be amortised over time.
Uncertainty quantification with neural Bayes estimators often proceeds through the bootstrap distribution (e.g., Lenzi et al., 2023; Richards et al., 2024; Sainsbury-Dale et al., 2024). Bootstrap-based approaches are particularly attractive when nonparametric bootstrap is possible (e.g., when the data are independent replicates), or when simulation from the fitted model is fast, in which case parametric bootstrap is also computationally efficient. However, these conditions are not always met and, although bootstrap-based approaches are often considered to be fairly accurate and favourable to methods based on asymptotic normality, there are situations where bootstrap procedures are not reliable (see, e.g., Canty et al., 2006, pg. 6).
Alternatively, by leveraging ideas from (Bayesian) quantile regression, one may construct a neural Bayes estimator that approximates a set of marginal posterior quantiles (Fisher et al., 2023; Sainsbury-Dale et al., 2025), which can then be used to construct credible intervals for each parameter. Inference then remains fully amortised since, once the estimators are trained, both point estimates and credible intervals can be obtained with virtually zero computational cost. Specifically, posterior quantiles can be targeted by training a neural Bayes estimator under the loss function \[ L(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}; \tau) \equiv \sum_{j=1}^d (\hat{\theta}_j - \theta_j)\{\mathbb{I}(\hat{\theta}_j - \theta_j) - \tau\}, \quad 0 < \tau < 1, \] where \(\mathbb{I}(\cdot)\) denotes the indicator function, since the Bayes estimator under this loss function is the vector of marginal posterior \(\tau\)-quantiles (Sainsbury-Dale et al., 2025, Sec. 2.2.4).
We now describe amortised approximate posterior inference through the minimisation of an expected Kullback–Leibler (KL) divergence. Throughout, we let \(q(\boldsymbol{\theta}; \boldsymbol{\kappa})\) denote a parametric approximation to the posterior distribution \(p(\boldsymbol{\theta} \mid \boldsymbol{Z})\), where the approximate-distribution parameters \(\boldsymbol{\kappa}\) belong to a space \(\mathcal{K}\).
We first consider the non-amortised case, where the optimal parameters \(\boldsymbol{\kappa}^*\) for a single data set \(\boldsymbol{Z}\) are found by minimising the KL divergence between \(p(\boldsymbol{\theta} \mid \boldsymbol{Z})\) and \(q(\boldsymbol{\theta}; \boldsymbol{\kappa})\): \[ \boldsymbol{\kappa}^* \equiv \underset{\boldsymbol{\kappa}}{\mathrm{arg\,min\;}} \textrm{KL}\{p(\boldsymbol{\theta} \mid \boldsymbol{Z}) \, \| \, q(\boldsymbol{\theta} ;\boldsymbol{\kappa})\} = \underset{\boldsymbol{\kappa}}{\mathrm{arg\,min\;}} -\int_{\Theta} \log\{ q(\boldsymbol{\theta} ;\boldsymbol{\kappa})\} p(\boldsymbol{\theta} \mid \boldsymbol{Z}) \textrm{d}\boldsymbol{\theta} . \] The resulting approximate posterior \(q(\boldsymbol{\theta}; \boldsymbol{\kappa}^*)\) targets the true posterior in the sense that the KL divergence is zero if and only if \(q(\boldsymbol{\theta}; \boldsymbol{\kappa}^*) = p(\boldsymbol{\theta} \mid \boldsymbol{Z})\) for all \(\boldsymbol{\theta} \in \Theta\). However, solving this optimisation problem is often computationally demanding even for a single data set \(\boldsymbol{Z}\), and solving it for many different data sets can be computationally prohibitive. The optimisation problem can be amortised by treating the parameters \(\boldsymbol{\kappa}\) as a function \(\boldsymbol{\kappa} : \mathcal{Z} \to \mathcal{K}\), and then choosing the function \(\boldsymbol{\kappa}^*(\cdot)\) that minimises an expected KL divergence: \[ \boldsymbol{\kappa}^*(\cdot) \equiv \underset{\boldsymbol{\kappa}(\cdot)}{\mathrm{arg\,min\;}} \mathbb{E}_{\boldsymbol{Z}}[\textrm{KL}\{p(\boldsymbol{\theta} \mid \boldsymbol{Z}) \, \| \, q(\boldsymbol{\theta} ;\boldsymbol{\kappa}(\boldsymbol{Z}))\}]. \] In practice, we approximate \(\boldsymbol{\kappa}^*(\cdot)\) using a neural network, \(\boldsymbol{\kappa}_{\boldsymbol{\gamma}} : \mathcal{Z} \to \mathcal{K}\), which is parameterised by \(\boldsymbol{\gamma}\) and trained by minimising a Monte Carlo approximation of the expected KL divergence above: \[ \boldsymbol{\gamma}^* \equiv \underset{\boldsymbol{\gamma}}{\mathrm{arg\,min}} -\sum_{k=1}^K \log q(\boldsymbol{\theta}^{(k)}; \boldsymbol{\kappa}_{\boldsymbol{\gamma}}(\boldsymbol{Z}^{(k)})). \] Once trained, the neural network \(\boldsymbol{\kappa}_{\boldsymbol{\gamma}^*}(\cdot)\) may be used to estimate the optimal approximate-distribution parameters \(\boldsymbol{\kappa}^*\) given data \(\boldsymbol{Z}\) at almost no computational cost. The neural network \(\boldsymbol{\kappa}_{\boldsymbol{\gamma}^*}(\cdot)\), together with the corresponding approximate distribution \(q(\cdot; \boldsymbol{\kappa})\), is collectively referred to as a neural posterior estimator.
There are numerous options for the approximate distribution \(q(\cdot; \boldsymbol{\kappa})\). For instance, \(q(\cdot;\boldsymbol{\kappa})\) can be modelled as a Gaussian distribution (e.g., Chan et al., 2018; see the Julia documentation for GaussianDistribution), where the parameters \(\boldsymbol{\kappa} = (\boldsymbol{\mu}', \textrm{vech}(\boldsymbol{L})')'\) consist of a \(d\)-dimensional mean parameter \(\boldsymbol{\mu}\) and the \(d(d+1)/2\) non-zero elements of the lower Cholesky factor \(\boldsymbol{L}\) of a covariance matrix, and the half-vectorisation operator \(\textrm{vech}(\cdot)\) vectorises the lower triangle of its matrix argument. One may also consider Gaussian mixtures (e.g., Papamakarios & Murray, 2016) or trans-Gaussian distributions (e.g., Maceda et al., 2024). However, the most widely adopted approach is to model \(q(\cdot\,; \boldsymbol{\kappa})\) using a normalising flow (e.g., Ardizzone et al., 2019; Radev et al., 2022), excellent reviews for which are given by Kobyzev et al. (2020) and Papamakarios (2021). A particularly popular class of normalising flow is the affine coupling flow (e.g., Dinh et al., 2016; Kingma & Dhariwal, 2018; Ardizzone et al., 2019). Since affine coupling flows are universal density approximators (Teshima et al., 2020), they serve as the default and recommended choice for approximate distributions in this package; for further details, see the Julia documentation for NormalisingFlow.
Finally, we describe amortised inference by approximation of the likelihood-to-evidence ratio, \[ r(\boldsymbol{Z}, \boldsymbol{\theta}) \equiv p(\boldsymbol{Z} \mid \boldsymbol{\theta})/p(\boldsymbol{Z}), \] where \(p(\boldsymbol{Z} \mid \boldsymbol{\theta})\) is the likelihood and \(p(\boldsymbol{Z})\) is the marginal likelihood (also known as the model evidence).
The likelihood-to-evidence ratio is ubiquitous in statistical inference. For example, likelihood ratios of the form \(p(\boldsymbol{Z}\mid \boldsymbol{\theta}_0)/p(\boldsymbol{Z}\mid \boldsymbol{\theta}_1)=r(\boldsymbol{Z}, \boldsymbol{\theta}_0)/r(\boldsymbol{Z}, \boldsymbol{\theta}_1)\) are central to hypothesis testing and model comparison, and naturally appear in the transition probabilities of most standard MCMC algorithms used for Bayesian inference. Further, since the likelihood-to-evidence ratio is a prior-free quantity, its approximation facilitates Bayesian inference in applications where one requires multiple fits of the model under different prior distributions.
Unlike the methods discussed earlier, the likelihood-to-evidence ratio might not immediately seem like a quantity well-suited for approximation by neural networks, which are typically trained by minimising empirical risk functions. However, this ratio emerges naturally as a simple transformation of the optimal solution to a standard binary classification problem, derived through the minimisation of an average risk. Specifically, consider a binary classifier \(c(\boldsymbol{Z}, \boldsymbol{\theta})\) that distinguishes dependent data-parameter pairs \({(\boldsymbol{Z}', \boldsymbol{\theta}')' \sim p(\boldsymbol{Z}, \boldsymbol{\theta})}\) with class labels \(Y=1\) from independent data-parameter pairs \({(\tilde{\boldsymbol{Z}}', \tilde{\boldsymbol{\theta}}')' \sim p(\boldsymbol{Z})p(\boldsymbol{\theta})}\) with class labels \(Y=0\), and where the classes are balanced. Here, \(p(\boldsymbol{\theta})\) denotes an arbitrary “proposal” distribution for \(\boldsymbol{\theta}\) that does not, in general, coincide with the prior distribution (see below). Then, the Bayes classifier under binary cross-entropy loss is defined as \[ \begin{aligned} c^*(\cdot, \cdot) &\equiv \underset{c(\cdot, \cdot)}{\mathrm{arg\,min}} \sum_{y\in\{0, 1\}} \textrm{Pr}(Y = y) \int_\Theta\int_\mathcal{Z}L_{\textrm{BCE}}\{y, c(\boldsymbol{Z}, \boldsymbol{\theta})\}p(\boldsymbol{Z}, \boldsymbol{\theta} \mid Y = y)\textrm{d} \boldsymbol{Z} \textrm{d} \boldsymbol{\theta}\\ &= \underset{c(\cdot, \cdot)}{\mathrm{arg\,min}} - \int_\Theta\int_\mathcal{Z}\Big[\log\{c(\boldsymbol{Z}, \boldsymbol{\theta})\}p(\boldsymbol{Z}, \boldsymbol{\theta}) + \log\{1 - c(\boldsymbol{Z}, \boldsymbol{\theta})\}p(\boldsymbol{Z})p(\boldsymbol{\theta}) \Big]\textrm{d} \boldsymbol{Z} \textrm{d} \boldsymbol{\theta}, \end{aligned} \] where \(L_{\textrm{BCE}}(y, c) \equiv -y\log(c) - (1 - y) \log(1 - c)\). It can be shown (e.g., Hermans et al., 2020, App. B) that the Bayes classifier is given by \[ c^*(\boldsymbol{Z}, \boldsymbol{\theta}) = \frac{p(\boldsymbol{Z}, \boldsymbol{\theta})}{p(\boldsymbol{Z}, \boldsymbol{\theta}) + p(\boldsymbol{\theta})p(\boldsymbol{Z})}, \quad \boldsymbol{Z} \in \mathcal{Z}, \boldsymbol{\theta} \in \Theta, \] and, hence, \[ r(\boldsymbol{Z}, \boldsymbol{\theta}) = \frac{c^*(\boldsymbol{Z}, \boldsymbol{\theta})}{1 - c^*(\boldsymbol{Z}, \boldsymbol{\theta})}, \quad \boldsymbol{Z} \in \mathcal{Z}, \boldsymbol{\theta} \in \Theta. \] This connection links the likelihood-to-evidence ratio to the average-risk-optimal solution of a standard binary classification problem, and consequently provides a foundation for approximating the ratio using neural networks. Specifically, let \(c_{\boldsymbol{\gamma}}: \mathcal{Z} \times \Theta \to (0, 1)\) denote a neural network parametrised by \(\boldsymbol{\gamma}\). Then the Bayes classifier may be approximated by \(c_{\boldsymbol{\gamma}^*}(\cdot, \cdot)\), where \[ \boldsymbol{\gamma}^* \equiv \underset{\boldsymbol{\gamma}}{\mathrm{arg\,min}} -\sum_{k=1}^K \Big[\log\{c_{\boldsymbol{\gamma}}(\boldsymbol{Z}^{(k)}, \boldsymbol{\theta}^{(k)})\} + \log\{1 - c_{\boldsymbol{\gamma}}(\boldsymbol{Z}^{(\sigma(k))}, \boldsymbol{\theta}^{(k)})\} \Big], \] with each \(\boldsymbol{\theta}^{(k)}\) sampled independently from a “proposal” distribution \(p(\boldsymbol{\theta})\), \(\boldsymbol{Z}^{(k)} \sim p(\boldsymbol{Z} \mid \boldsymbol{\theta}^{(k)})\), and \(\sigma(\cdot)\) a random permutation of \(\{1, \dots, K\}\). The proposal distribution \(p(\boldsymbol{\theta})\) does not necessarily correspond to the prior distribution \(\pi(\boldsymbol{\theta})\), which is specified in the downstream inference algorithm (see below). In theory, any \(p(\boldsymbol{\theta})\) with support over \(\Theta\) can be used. However, with finite training data, the choice of \(p(\boldsymbol{\theta})\) is important, as it determines where the parameters \(\{\boldsymbol{\theta}^{(k)}\}\) are most densely sampled and, hence, where the neural network \(c_{\boldsymbol{\gamma}^*}(\cdot, \cdot)\) best approximates the Bayes classifier. Further, since neural networks are only reliable within the support of their training samples, a \(p(\boldsymbol{\theta})\) lacking full support over \(\Theta\) essentially acts as a “soft prior”.
Once the neural network is trained, \(r_{\boldsymbol{\gamma}^*}(\boldsymbol{Z}, \boldsymbol{\theta}) \equiv c_{\boldsymbol{\gamma}^*}(\boldsymbol{Z}, \boldsymbol{\theta})\{1 - c_{\boldsymbol{\gamma}^*}(\boldsymbol{Z}, \boldsymbol{\theta})\}^{-1}\), \(\boldsymbol{Z} \in \mathcal{Z}, \boldsymbol{\theta} \in \Theta\), may be used to quickly approximate the likelihood-to-evidence ratio, and therefore it is called a neural ratio estimator.
Inference based on a neural ratio estimator may proceed in a frequentist setting via maximum likelihood and likelihood ratios (e.g., Walchessen et al., 2024), and in a Bayesian setting by facilitating the computation of transition probabilities in Hamiltonian Monte Carlo and MCMC algorithms (e.g., Hermans et al., 2020). Further, an approximate posterior distribution can be obtained via the identity \({p(\boldsymbol{\theta} \mid \boldsymbol{Z})} = \pi(\boldsymbol{\theta}) r(\boldsymbol{\theta}, \boldsymbol{Z})\), and sampled from using standard sampling techniques (e.g., Thomas et al., 2022).
The R interface to NeuralEstimators
aims to facilitate
the use of neural inferential methods for statisticians familiar with R.
This is achieved by leveraging the R package JuliaConnectoR
(Lenz et al., 2022),
which enables seamless communication between R and Julia. The core
functions of NeuralEstimators
, such as training and
assessing the neural networks, are wrapped in R functions to make them
accessible (and user-friendly) within the R environment. The only step
that requires users to write Julia code is the construction of the
neural network architecture. This design choice ensures maximum
flexibility for defining custom architectures.
The typical workflow when using the R interface to
NeuralEstimators
is as follows, with the steps requiring
Julia code explicitly highlighted:
Flux.jl
model.
The architecture class (e.g., MLP, CNN, GNN) should align with the
multivariate structure of the data (e.g., unstructured, grid, graph).
The specific input and output spaces of the neural network depend on the
chosen inferential method:
NeuralEstimator
corresponding to the intended inferential method:
PointEstimator
;PosteriorEstimator
(see also the available built-in approximate
distributions);RatioEstimator
.NeuralEstimator
using train()
and the training set, monitoring performance and convergence using the
validation set.NeuralEstimator
using assess()
and the test set.Once the NeuralEstimator
has passed our assessments and
is deemed to be well calibrated, it may be used to make inference with
observed data.
We now consider examples where the data are univariate (Section 2.1), multivariate and unstructured (Section
2.2), collected over a regular grid
(Section 2.3), or collected over arbitrary
spatial locations (Section 2.4).
These examples focus on neural Bayes estimation using a PointEstimator
.
For a list of other estimator types available in the package (e.g., for
neural posterior or ratio estimation), see Estimators
and the examples therein.
Before proceeding, we load the required packages: see the README
for instructions on installing Julia and the Julia packages
NeuralEstimators
and Flux
. If you have access
to an NVIDIA graphics processing unit (GPU), you can utilise it by
simply adding CUDA
to the list of Julia packages in the
final line below:
library("NeuralEstimators")
library("JuliaConnectoR")
library("ggplot2")
library("ggpubr")
library("latex2exp")
juliaEval('using NeuralEstimators, Flux')
#> Starting Julia ...
Here, we develop a neural Bayes estimator for \(\boldsymbol{\theta} \equiv (\mu, \sigma)'\) from data \(\boldsymbol{Z} \equiv (Z_1, \dots, Z_m)'\), where each \(Z_i \overset{\mathrm{iid}}\sim \mathrm{Gau}(\mu, \sigma^2)\).
First, we sample parameters from the prior to construct parameter sets used for training and validating the estimator. Here, we use the priors \(\mu \sim \rm{Gau}(0, 1)\) and \(\sigma \sim \rm{Gamma}(1, 1)\), and we assume that the parameters are independent a priori. The sampled parameters are stored as \(d \times K\) matrices, with \(d\) the number of parameters in the model and \(K\) the number of sampled parameter vectors.
# Sampling from the prior
# K: number of samples to draw from the prior
sampler <- function(K) {
mu <- rnorm(K)
sigma <- rgamma(K, 1)
Theta <- matrix(c(mu, sigma), byrow = TRUE, ncol = K)
return(Theta)
}
K <- 10000
theta_train <- sampler(K)
theta_val <- sampler(K/10)
Next, we define the statistical model implicitly through data
simulation. The simulated data are stored as a list
, where
each element of the list corresponds to a data set simulated
conditionally on one parameter vector. The storage type of each data set
depends on the multivariate structure of the data (e.g., a matrix for
unstructured multivariate data, a multidimensional array for gridded
data). In this example, each replicate \(Z_1,
\dots, Z_m\) is univariate, so the data are stored as a matrix
with \(n = 1\) row and \(m\) columns:
# Marginal simulation from the statistical model
# theta: a matrix of parameters drawn from the prior
# m: number of conditionally independent replicates for each parameter vector
simulate <- function(Theta, m) {
apply(Theta, 2, function(theta) t(rnorm(m, theta[1], theta[2])), simplify = FALSE)
}
m <- 30
Z_train <- simulate(theta_train, m)
Z_val <- simulate(theta_val, m)
Let’s visualise a few elements of the training set:
mu <- theta_train[1, 1:4]
sigma <- theta_train[2, 1:4]
Z <- Z_train[1:4]
df <- Map(function(z, m, s) {
data.frame(Z = c(z), mu = m, sigma = s)
}, Z, mu, sigma)
df <- do.call(rbind, df)
df$theta <- paste0("$\\mu$ = ", round(df$mu, 3), ", $\\sigma$ = ", round(df$sigma, 3))
df$theta <- as.factor(df$theta)
levels(df$theta) <- sapply(levels(df$theta), TeX)
ggplot(df) +
geom_histogram(aes(x = Z), bins = 30, fill = "blue", alpha = 0.5, color = "black") +
facet_grid(~theta, labeller = label_parsed) +
theme_bw()
We now design our neural network.
As we are constructing a neural Bayes estimator, the neural network is a mapping \(\mathcal{Z}\to\Theta\), and the dimensionality of the neural-network output is therefore \(d \equiv \textrm{dim}(\Theta) = 2\).
Since our data are replicated, we adopt the DeepSets framework,
implemented in NeuralEstimators
with the type DeepSet
.
DeepSets consist of two neural networks: an inner network and an outer
network. The inner network extracts summary statistics from the data,
and its architecture depends on the multivariate structure of the data.
For unstructured data (i.e., data without spatial or temporal
correlation within each replicate), we use a multilayer perceptron
(MLP). The input dimension matches the dimensionality of each data
replicate, while the output dimension corresponds to the number of
summary statistics appropriate for the model (a common choice is \(d\)). The outer network maps the learned
summary statistics to the output space (here, the parameter space, \(\Theta\)). The outer network is always an
MLP.
Below is an example of a DeepSets architecture for neural Bayes estimation in this example. Note that many models have parameter constraints (e.g., variance and range parameters that must be strictly positive). These constraints can be incorporated in the final layer of the neural network by choosing appropriate activation functions for each parameter. Here, we enforce the constraint \(\sigma > 0\) by applying the softplus activation function in the final layer of the outer network, ensuring that all parameter estimates are valid. For some additional ways to constrain parameter estimates, see Output layers.
# Initialise the estimator
estimator <- juliaEval('
n = 1 # dimension of each data replicate (univariate)
d = 2 # dimension of the parameter vector θ
w = 128 # width of each hidden layer
# Final layer has output dimension d and enforces parameter constraints
final_layer = Parallel(
vcat,
Dense(w, 1, identity), # mean parameter: no constraints
Dense(w, 1, softplus) # standard-deviation parameter: strictly positive
)
# Construct inner and outer networks and combine into DeepSet
psi = Chain(Dense(n, w, relu), Dense(w, w, relu))
phi = Chain(Dense(w, w, relu), final_layer)
network = DeepSet(psi, phi)
# Wrap the neural network as a PointEstimator
estimator = PointEstimator(network)
')
Next, we train the estimator using train()
, here using
the default mean-absolute-error loss, so that the resulting neural Bayes
estimator approximates the marginal posterior medians. We use the sets
of parameters and simulated data constructed earlier; one may
alternatively use the arguments sampler
and
simulator
to pass functions for sampling from the prior and
simulating from the model, respectively, which facilitates the technique
known as “simulation on-the-fly”. These functions can be defined in R
(as we have done in this example) or in Julia using the package
JuliaConnectoR
(see the help file for
train()
).
# Train the estimator
estimator <- train(
estimator,
theta_train = theta_train,
theta_val = theta_val,
Z_train = Z_train,
Z_val = Z_val
)
#> Computing the initial validation risk... Initial validation risk = 0.7422407
#> Computing the initial training risk... Initial training risk = 0.76490337
#> Epoch: 1 Training risk: 0.245 Validation risk: 0.177 Run time of epoch: 9.313 seconds
#> Epoch: 2 Training risk: 0.174 Validation risk: 0.15 Run time of epoch: 1.203 seconds
#> Epoch: 3 Training risk: 0.158 Validation risk: 0.155 Run time of epoch: 1.019 seconds
#> Epoch: 4 Training risk: 0.154 Validation risk: 0.145 Run time of epoch: 1.25 seconds
#> Epoch: 5 Training risk: 0.149 Validation risk: 0.139 Run time of epoch: 1.242 seconds
#> Epoch: 6 Training risk: 0.145 Validation risk: 0.132 Run time of epoch: 1.101 seconds
#> Epoch: 7 Training risk: 0.143 Validation risk: 0.13 Run time of epoch: 1.114 seconds
#> Epoch: 8 Training risk: 0.143 Validation risk: 0.131 Run time of epoch: 1.04 seconds
#> Epoch: 9 Training risk: 0.143 Validation risk: 0.134 Run time of epoch: 1.078 seconds
#> Epoch: 10 Training risk: 0.141 Validation risk: 0.13 Run time of epoch: 1.035 seconds
#> Epoch: 11 Training risk: 0.139 Validation risk: 0.131 Run time of epoch: 0.926 seconds
#> Epoch: 12 Training risk: 0.136 Validation risk: 0.131 Run time of epoch: 1.248 seconds
#> Epoch: 13 Training risk: 0.139 Validation risk: 0.125 Run time of epoch: 0.79 seconds
#> Epoch: 14 Training risk: 0.136 Validation risk: 0.127 Run time of epoch: 0.959 seconds
#> Epoch: 15 Training risk: 0.136 Validation risk: 0.13 Run time of epoch: 1.173 seconds
#> Epoch: 16 Training risk: 0.136 Validation risk: 0.131 Run time of epoch: 1.014 seconds
#> Epoch: 17 Training risk: 0.135 Validation risk: 0.128 Run time of epoch: 1.005 seconds
#> Epoch: 18 Training risk: 0.135 Validation risk: 0.132 Run time of epoch: 0.874 seconds
#> Epoch: 19 Training risk: 0.135 Validation risk: 0.128 Run time of epoch: 1.153 seconds
#> Stopping early since the validation loss has not improved in 5 epochs
If the argument savepath
in train()
is
specified, the neural-network state (e.g., the weights and biases) will
be saved during training as bson
files, and the best state
(as measured by validation risk) will be saved as
best_network.bson
. To load these saved states into a neural
network in later R
sessions, one may use the function
loadstate()
. Note that one may also manually save a trained
estimator using savestate()
.
Once a neural Bayes estimator has been trained, its performance should be assessed. Simulation-based (empirical) methods for evaluating the performance of a neural Bayes estimator are ideal, since simulation is already required for constructing the estimator, and because the estimator can be applied to thousands of simulated data sets at almost no computational cost.
To assess the accuracy of the resulting neural Bayes estimator, one
may use the function assess()
. The resulting object can
then be passed to the functions risk()
,
bias()
, and rmse()
to compute a Monte Carlo
approximation of the Bayes risk, the bias, and the RMSE, or passed to
the function plotestimates()
to visualise the estimates
against the corresponding true values:
# Assess the estimator
theta_test <- sampler(1000)
Z_test <- simulate(theta_test, m)
assessment <- assess(estimator, theta_test, Z_test, estimator_names = "NBE")
#> Running NBE...
plotestimates(assessment, parameter_labels = c("θ1" = expression(mu), "θ2" = expression(sigma)))
Once the neural Bayes estimator has been trained and assessed, it can
be applied to observed data using the function estimate()
,
and non-parametric bootstrap estimates can be computed using the
function bootstrap()
. Below, we use simulated data as a
surrogate for observed data:
theta <- as.matrix(c(0, 0.5)) # true parameters
Z <- simulate(theta, m) # pretend that this is observed data
estimate(estimator, Z) # point estimate from the "observed data"
#> [,1]
#> [1,] -0.1159449
#> [2,] 0.5819227
bootstrap(estimator, Z)[, 1:6] # non-parametric bootstrap estimates
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] -0.2536258 -0.1886561 -0.1325511 0.0453563 -0.1121531 5.151611e-05
#> [2,] 0.7270321 0.6252608 0.4906906 0.6401244 0.4488662 6.788481e-01
Suppose now that our data consists of \(m\) replicates of a \(d\)-dimensional multivariate distribution.
For unstructured \(d\)-dimensional data, the estimator is based on a multilayer perceptron (MLP), and each data set is stored as a two-dimensional array (a matrix), with the second dimension storing the independent replicates. That is, in this case, we store the data as a list of \(d \times m\) matrices, and the inner network of the DeepSets representation has \(d\) neurons in the input layer.
For example, consider the task of estimating \(\boldsymbol{\theta} \equiv (\mu_1, \mu_2, \sigma, \rho)'\) from data \(\boldsymbol{Z}_1, \dots, \boldsymbol{Z}_m\) that are independent and identically distributed according to the bivariate Gaussian distribution, \[ \rm{Gau}\left(\begin{bmatrix}\mu_1 \\ \mu_2\end{bmatrix}, \sigma^2\begin{bmatrix} 1 & \rho \\ \rho & 1\end{bmatrix}\right). \] Then, to construct a neural Bayes estimator for this model, one may use the following code for defining a prior, the data simulator, and the neural-network architecture:
# Sampling from the prior
# K: number of samples to draw from the prior
sampler <- function(K) {
mu1 <- rnorm(K)
mu2 <- rnorm(K)
sigma <- rgamma(K, 1)
rho <- runif(K, -1, 1)
theta <- matrix(c(mu1, mu2, sigma, rho), byrow = TRUE, ncol = K)
return(theta)
}
# Marginal simulation from the statistical model
# theta: a matrix of parameters drawn from the prior
# m: number of conditionally independent replicates for each parameter vector
simulate <- function(Theta, m) {
apply(Theta, 2, function(theta) {
mu <- c(theta[1], theta[2])
sigma <- theta[3]
rho <- theta[4]
Sigma <- sigma^2 * matrix(c(1, rho, rho, 1), 2, 2)
Z <- MASS::mvrnorm(m, mu, Sigma)
t(Z)
}, simplify = FALSE)
}
# Initialise the estimator
estimator <- juliaEval('
d = 2 # dimension of each replicate
w = 32 # number of neurons in each hidden layer
# Layer to ensure valid estimates
final_layer = Parallel(
vcat,
Dense(w, 2, identity), # mean parameters
Dense(w, 1, softplus), # variance parameter
Dense(w, 1, tanh) # correlation parameter
)
psi = Chain(Dense(d, w, relu), Dense(w, w, relu), Dense(w, w, relu))
phi = Chain(Dense(w, w, relu), Dense(w, w, relu), final_layer)
deepset = DeepSet(psi, phi)
estimator = PointEstimator(deepset)
')
Let’s visualise a few realisations from the current model:
theta <- sampler(3)
Z <- simulate(theta, m = 250)
mu1 <- theta[1, 1:3]
mu2 <- theta[2, 1:3]
sigma <- theta[3, 1:3]
rho <- theta[4, 1:3]
df <- Map(function(z, m1, m2, s, r) {
data.frame(Z1 = z[1, ], Z2 = z[2, ], mu1 = m1, mu2 = m2, sigma = s, rho = r)
}, Z, mu1, mu2, sigma, rho)
df <- do.call(rbind, df)
df$theta <- paste0(
"$\\mu_1$ = ", round(df$mu1, 2), ", ",
"$\\mu_2$ = ", round(df$mu2, 2), ", ",
"$\\sigma$ = ", round(df$sigma, 2), ", ",
"$\\rho$ = ", round(df$rho, 2)
)
df$theta <- as.factor(df$theta)
levels(df$theta) <- sapply(levels(df$theta), TeX)
ggplot(df) +
geom_point(aes(x = Z1, y = Z2), alpha = 0.5) +
facet_grid(~theta, labeller = label_parsed) +
theme_bw()
The training, assessment, and application of the neural Bayes estimator then remains as in the example above:
# Construct training and validation data sets
K <- 5000
m <- 250
theta_train <- sampler(K)
theta_val <- sampler(K/10)
Z_train <- simulate(theta_train, m)
Z_val <- simulate(theta_val, m)
# Train the estimator
estimator <- train(
estimator,
theta_train = theta_train,
theta_val = theta_val,
Z_train = Z_train,
Z_val = Z_val,
epochs = 20
)
#> Computing the initial validation risk... Initial validation risk = 0.7245567
#> Computing the initial training risk... Initial training risk = 0.6967557
#> Epoch: 1 Training risk: 0.396 Validation risk: 0.224 Run time of epoch: 3.916 seconds
#> Epoch: 2 Training risk: 0.197 Validation risk: 0.183 Run time of epoch: 1.719 seconds
#> Epoch: 3 Training risk: 0.173 Validation risk: 0.167 Run time of epoch: 1.773 seconds
#> Epoch: 4 Training risk: 0.161 Validation risk: 0.161 Run time of epoch: 1.73 seconds
#> Epoch: 5 Training risk: 0.151 Validation risk: 0.153 Run time of epoch: 1.619 seconds
#> Epoch: 6 Training risk: 0.146 Validation risk: 0.145 Run time of epoch: 1.735 seconds
#> Epoch: 7 Training risk: 0.141 Validation risk: 0.143 Run time of epoch: 1.744 seconds
#> Epoch: 8 Training risk: 0.136 Validation risk: 0.139 Run time of epoch: 1.686 seconds
#> Epoch: 9 Training risk: 0.136 Validation risk: 0.139 Run time of epoch: 1.546 seconds
#> Epoch: 10 Training risk: 0.132 Validation risk: 0.137 Run time of epoch: 1.715 seconds
#> Epoch: 11 Training risk: 0.131 Validation risk: 0.129 Run time of epoch: 1.678 seconds
#> Epoch: 12 Training risk: 0.127 Validation risk: 0.128 Run time of epoch: 1.669 seconds
#> Epoch: 13 Training risk: 0.126 Validation risk: 0.129 Run time of epoch: 1.719 seconds
#> Epoch: 14 Training risk: 0.126 Validation risk: 0.122 Run time of epoch: 1.917 seconds
#> Epoch: 15 Training risk: 0.123 Validation risk: 0.124 Run time of epoch: 1.885 seconds
#> Epoch: 16 Training risk: 0.123 Validation risk: 0.124 Run time of epoch: 1.839 seconds
#> Epoch: 17 Training risk: 0.119 Validation risk: 0.121 Run time of epoch: 2.005 seconds
#> Epoch: 18 Training risk: 0.122 Validation risk: 0.126 Run time of epoch: 1.882 seconds
#> Epoch: 19 Training risk: 0.117 Validation risk: 0.126 Run time of epoch: 1.796 seconds
#> Epoch: 20 Training risk: 0.118 Validation risk: 0.117 Run time of epoch: 1.791 seconds
# Assess the estimator
theta_test <- sampler(1000)
Z_test <- simulate(theta_test, m)
assessment <- assess(estimator, theta_test, Z_test,
estimator_names = "NBE",
parameter_names = c("mu1", "mu2", "sigma", "rho"))
#> Running NBE...
plotestimates(assessment)
For data collected over a regular grid, neural Bayes estimators are based on a convolutional neural network (CNN). We give specific attention to this case in a separate vignette. There, we also outline how to perform neural Bayes estimation with incomplete/missing data based on methods described by Wang et al. (2024) and Sainsbury-Dale et al. (2025b).
To cater for spatial data collected over arbitrary spatial locations, one may construct a neural estimator with a graph neural network (GNN; see Sainsbury-Dale, Zammit-Mangion, Richards, and Huser, 2025). The overall workflow remains as given in previous examples, with two key additional considerations.
First, if inference is to be made from a single spatial data set collected before constructing estimator, training data can be simulated using the observed spatial locations, which can be treated as fixed and known. However, if the estimator is intended for application to multiple spatial data sets with varying spatial configurations, it should be trained on a diverse set of spatial configurations. These configurations can be sampled during training, possibly using a spatial point process.
Second, the spatial data should be stored as a graph.
For illustration, we develop a neural Bayes estimator for a spatial Gaussian process model, where \(\boldsymbol{Z} \equiv (Z_{1}, \dots, Z_{n})'\) are data collected at locations \(\{\boldsymbol{s}_{1}, \dots, \boldsymbol{s}_{n}\}\) in a spatial domain that is a subset of \(\mathbb{R}^2\). The data are modelled as spatially-correlated mean-zero Gaussian random variables with exponential covariance function, given by \[ \textrm{cov}(Z_i, Z_j) = \textrm{exp}(-\|\boldsymbol{s}_i - \boldsymbol{s}_j\|/\theta), \quad i, j = 1, \dots, n, \] with unknown range parameter \(\theta > 0\). Here, we take the spatial domain to be the unit square, we sample spatial configurations from a Matern cluster process, and we adopt a uniform prior \(\theta \sim \rm{Unif}(0, 0.4)\).
To begin, we define a function for sampling from the prior, and a function for sampling spatial configurations and simulating from the statistical model conditional on the sampled spatial configurations and parameter vectors.
# Sampling from the prior
# K: number of samples to draw from the prior
sampler <- function(K) {
theta <- runif(K, max = 0.4) # draw from the prior
theta <- t(theta) # reshape to matrix
return(theta)
}
# Marginal simulation from the statistical model, split into simulation of the
# spatial configurations S and the spatial data Z
# theta: a matrix of parameters drawn from the prior
# m: number of conditionally independent replicates for each parameter vector
library("parallel") # mclapply()
library("spatstat") # rMatClust()
simulateS <- function(K) {
# Simulate spatial configurations over the unit square, with each configuration
# drawn from a Matern cluster process with different, randomly sampled hyperparameters
n <- sample(100:300, K, replace = TRUE) # approximate sample size
lambda <- runif(K, 10, 50) # intensity of parent Poisson point process
mu <- n/lambda # mean number of daughter points
r <- runif(K, 0.05, 0.2) # cluster radius
S <- lapply(1:K, function(k) {
pts <- rMatClust(lambda[k], r = r[k], mu = mu[k])
cbind(x = pts$x, y = pts$y)
})
return(S)
}
simulateZ <- function(theta, S, m = 1) {
# Simulate conditionally independent replicates for each pair of
# parameters and spatial configurations
Z <- mclapply(1:ncol(theta), function(k) {
D <- as.matrix(dist(S[[k]])) # distance matrix
n <- nrow(D) # sample size
Sigma <- exp(-D/theta[k]) # covariance matrix
L <- t(chol(Sigma)) # Cholesky factor of the covariance matrix
e <- matrix(rnorm(n*m), nrow = n, ncol = m) # standard normal variates
Z <- L %*% e # conditionally independent replicates from the model
Z
})
return(Z)
}
simulate <- function(theta, S, ...) {
K <- ncol(theta)
S <- simulateS(K)
Z <- simulateZ(theta, S, ...)
G <- spatialgraph(S, Z) # graph from the spatial configurations and associated spatial data
return(G)
}
Let’s visualise a few realisations from our spatial Gaussian process model:
theta <- matrix(c(0.05, 0.1, 0.2, 0.4), nrow = 1)
K <- ncol(theta)
S <- simulateS(K)
Z <- simulateZ(theta, S)
df <- Map(function(z, th, s) {
data.frame(Z = c(z), theta = th, s1 = s[, 1], s2 = s[, 2])
}, Z, theta, S)
df <- do.call(rbind, df)
df$theta <- paste0("$\\theta$ = ", round(df$theta, 2))
df$theta <- as.factor(df$theta)
levels(df$theta) <- sapply(levels(df$theta), TeX)
ggplot(df) +
geom_point(aes(x = s1, y = s2, colour = Z)) +
facet_grid(~theta, labeller = label_parsed) +
scale_colour_viridis_c(option = "magma") +
labs(x = expression(s[1]), y = expression(s[2])) +
scale_x_continuous(breaks = c(0.2, 0.5, 0.8)) +
scale_y_continuous(breaks = c(0.2, 0.5, 0.8)) +
coord_fixed() +
theme_bw()
Next, we define our GNN architecture and initialise the estimator. Here, we use an architecture tailored to isotropic spatial dependence models; for further details, see Section 2.2.1 of Sainsbury-Dale et al. (2025). In this example our goal is to construct a point estimator, however other kinds of estimators can be constructed by substituting the appropriate estimator class in the final line below:
# Initialise the estimator
estimator <- juliaEval('
# Load Julia packages for GNN functionality
using GraphNeuralNetworks
using Statistics: mean
# Spatial weight functions: continuous surrogates for 0-1 basis functions
h_max = 0.15 # maximum distance to consider
q = 10 # output dimension of the spatial weights
w = KernelWeights(h_max, q)
# Propagation module
propagation = GNNChain(
SpatialGraphConv(1 => q, relu, w = w, w_out = q),
SpatialGraphConv(q => q, relu, w = w, w_out = q)
)
# Readout module
readout = GlobalPool(mean)
# Inner network
psi = GNNSummary(propagation, readout)
# Expert summary statistics, the empirical variogram
V = NeighbourhoodVariogram(h_max, q)
# Outer network
phi = Chain(
Dense(2q => 128, relu),
Dense(128 => 128, relu),
Dense(128 => 1, softplus) # softplus activation to ensure positive estimates
)
# DeepSet object
deepset = DeepSet(psi, phi; S = V)
# Point estimator
estimator = PointEstimator(deepset)
')
Next, we train the estimator:
# Construct training and validation data sets
K <- 5000
theta_train <- sampler(K)
theta_val <- sampler(K/10)
Z_train <- simulate(theta_train)
Z_val <- simulate(theta_val)
# Train the estimator
estimator <- train(
estimator,
theta_train = theta_train,
theta_val = theta_val,
Z_train = Z_train,
Z_val = Z_val,
epochs = 20
)
#> Computing the initial validation risk... Initial validation risk = 0.43207824
#> Computing the initial training risk... Initial training risk = 0.42689615
#> Epoch: 1 Training risk: 0.046 Validation risk: 0.026 Run time of epoch: 42.692 seconds
#> Epoch: 2 Training risk: 0.027 Validation risk: 0.027 Run time of epoch: 27.13 seconds
#> Epoch: 3 Training risk: 0.026 Validation risk: 0.025 Run time of epoch: 27.295 seconds
#> Epoch: 4 Training risk: 0.024 Validation risk: 0.025 Run time of epoch: 27.866 seconds
#> Epoch: 5 Training risk: 0.024 Validation risk: 0.024 Run time of epoch: 26.801 seconds
#> Epoch: 6 Training risk: 0.023 Validation risk: 0.025 Run time of epoch: 28.124 seconds
#> Epoch: 7 Training risk: 0.023 Validation risk: 0.024 Run time of epoch: 27.827 seconds
#> Epoch: 8 Training risk: 0.022 Validation risk: 0.024 Run time of epoch: 28.377 seconds
#> Epoch: 9 Training risk: 0.023 Validation risk: 0.023 Run time of epoch: 28.272 seconds
#> Epoch: 10 Training risk: 0.022 Validation risk: 0.024 Run time of epoch: 27.355 seconds
#> Epoch: 11 Training risk: 0.022 Validation risk: 0.023 Run time of epoch: 27.683 seconds
#> Epoch: 12 Training risk: 0.022 Validation risk: 0.025 Run time of epoch: 27.456 seconds
#> Epoch: 13 Training risk: 0.022 Validation risk: 0.023 Run time of epoch: 27.191 seconds
#> Epoch: 14 Training risk: 0.022 Validation risk: 0.023 Run time of epoch: 26.82 seconds
#> Epoch: 15 Training risk: 0.022 Validation risk: 0.023 Run time of epoch: 27.02 seconds
#> Stopping early since the validation loss has not improved in 5 epochs
Note that the computations in GNNs are performed in parallel, making
them particularly well-suited for GPUs, which typically contain
thousands of cores. If you have access to an NVIDIA GPU, you can utilise
it by simply loading the Julia package CUDA
, for example,
by running juliaEval('using CUDA')
.
Next, we assess the trained estimator:
# Assess the estimator
theta_test <- sampler(1000)
Z_test <- simulate(theta_test)
assessment <- assess(estimator, theta_test, Z_test, estimator_names = "GNN NBE")
#> Running GNN NBE...
plotestimates(assessment)
Finally, once the estimator has been assessed, it may be applied to observed data: