Abstract
We describe the syntax of the widely popularnumDeriv
package and show how the functions from the pnd
package
recognise and handle the parameters related to numerical difference
computation. We draw parallels between the two, examine the differences,
and provide recommendations for new users.
Load the packages first:
library(pnd)
#> Parallel numerical derivatives v. 0.0.6 (2025-01-23).
#> 4 physical cores for parallelism through mclapply forking are available on Linux.
library(numDeriv)
Numerical derivatives are ubiquitous in applies statistics and
numerical methods, which is why many packages rely on streamlined
implementations of numerical derivatives as dependencies. The popular
numDeriv
package by Paul Gilbert and Ravi Varadhan has 314
reverse dependencies as of October 2024. Its flexibility allowed other
packages to achieve better results by providing reasonable defaults and
the interface for fine-tuning that matters in numerical optimisation and
statistical inference. However, it has no parallel capabilities, which
is why it takes a lot of time to calculate numerical gradients with its
grad
function.
This vignette showcases the similarities and differences between the
features of numDeriv
and pnd
for a smooth and
painless transition to the new package.
numDeriv::grad
The implementation of numDeriv::grad()
is clever: it
allows both vectorised and non-vectorised functions to be provided as
the input argument. Since gradients are defined for scalar functions, it
has to check whether the function of interest is scalar- or vector
values. This can be non-trivial with vectorised functions: although
\(\sin x\) is as scalar function,
sin(1:100)
will return a vector in R. Therefore,
numDeriv::grad()
has to determine if the
implementation of \(f\) is
\(\mathbb{R}^n \mapsto \mathbb{R}^n\)
(scalar / vectorised scalar), or \(\mathbb{R}^n \mapsto \mathbb{R}\) (scalar
multivariate function), or neither. Hence, functions like \(\mathbb{R} \mapsto \mathbb{R}^n\) or \(\mathbb{R}^n \mapsto \mathbb{R}^m\) are not
considered valid inputs.
Internally, numDeriv::grad(f, x0, ...)
computes
f0 = f(x0, ...)
and checks the length of f0
.
For the input argument x0
with length(x0) = n
,
the allowed lengths of f0
are either 1
(scalar
output) or n
(vectorised output). Expressions such as
grad(function(x) c(sin(x), cos(x)), x = 1)
return an error,
implying that for vector-valued function, the user might want to compute
a Jacobian. Similarly, numDeriv::hessian()
checks the
dimensions of f0 = f(x0, ...)
and only allows
non-vectorised functions with length(f0) == 1
. Finally, the
function numDeriv::genD
computes the first- and
second-derivative information (without cross-derivatives).
This check is not without its drawbacks: it may miscalculate function
dimensions. The implementation in pnd::Grad()
handles the
edge cases that cause wrong outputs in
numDeriv::grad()
.
The pnd
package provides similar functions:
Grad()
is a drop-in replacement for
numDeriv::grad()
, Hessian()
replaces
numDeriv::hessian()
, Jacobian()
subsumes
numDeriv::jacobian()
, and GenD()
, whilst not
corresponding to numDeriv::genD()
, is the real workhorse
that computes arbitrary derivatives of arbitrary accuracy order.
Example 1: correct dimensionality check results for vector-valued functions.
f2 <- function(x) c(sin(x), cos(x)) # Vector output -> gradient is unsupported
grad(f2, x = 1:4)
#> Error in grad.default(f2, x = 1:4): grad assumes a scalar valued function.
hessian(f2, x = 1:4)
#> Error in hessian.default(f2, x = 1:4): Richardson method for hessian assumes a scalar valued function.
Grad(f2, x = 1:4)
#> Error in Grad(f2, x = 1:4): Use 'Jacobian()' instead of 'Grad()' for vector-valued functions to obtain a matrix of derivatives.
This check correctly identifies non-vectorised functions as well:
f2 <- function(x) c(sum(sin(x)), sum(cos(x)))
grad(f2, x = 1:4)
#> Error in grad.default(f2, x = 1:4): grad assumes a scalar valued function.
hessian(f2, x = 1:4)
#> Error in hessian.default(f2, x = 1:4): Richardson method for hessian assumes a scalar valued function.
jacobian(f2, x = 1:4)
#> [,1] [,2] [,3] [,4]
#> [1,] 0.5403023 -0.4161468 -0.9899925 -0.6536436
#> [2,] -0.8414710 -0.9092974 -0.1411200 0.7568025
Grad(f2, x = 1:4)
#> Error in Grad(f2, x = 1:4): Use 'Jacobian()' instead of 'Grad()' for vector-valued functions to obtain a matrix of derivatives.
Jacobian(f2, x = 1:4)
#> [,1] [,2] [,3] [,4]
#> [1,] 0.5403023 -0.4161468 -0.9899925 -0.6536436
#> [2,] -0.8414710 -0.9092974 -0.1411200 0.7568025
#> attr(,"step.size")
#> [1] 6.055454e-06 1.211091e-05 1.816636e-05 2.422182e-05
#> attr(,"step.size.method")
#> [1] "default"
Example 2: valid input, invalid output in
numDeriv
.
If a function consists of individually vectorised components and
returns an output that differs in length from the input, then,
numDeriv::jacobian
may return wrong results. Specifically,
the output should be stacked coordinate-wise gradients, but is made of
stacked diagonal matrices instead.
f2 <- function(x) c(sin(x), cos(x))
grad(f2, x = 1:4)
#> Error in grad.default(f2, x = 1:4): grad assumes a scalar valued function.
jacobian(f2, x = 1:4)
#> [,1] [,2] [,3] [,4]
#> [1,] 0.5403023 0.0000000 0.0000000 0.0000000
#> [2,] 0.0000000 -0.4161468 0.0000000 0.0000000
#> [3,] 0.0000000 0.0000000 -0.9899925 0.0000000
#> [4,] 0.0000000 0.0000000 0.0000000 -0.6536436
#> [5,] -0.8414710 0.0000000 0.0000000 0.0000000
#> [6,] 0.0000000 -0.9092974 0.0000000 0.0000000
#> [7,] 0.0000000 0.0000000 -0.1411200 0.0000000
#> [8,] 0.0000000 0.0000000 0.0000000 0.7568025
There are 3 methods implemented in numDeriv
for gradient
calculation: ‘simple’, ‘Richardson’, and ‘complex’. For simplicity, the
formulæ given here assume scalar arguments; for vector arguments, these
calculations are done coordinate by coordinate.
method = "simple"
, numDeriv::grad()
computes a fast one-sided difference with a fixed step size (the default
is 0.0001): \(f'_{\mathrm{simple}} :=
\frac{f(x + h) - f(x)}{h}\) or \(\frac{f(x) - f(x-h)}{h}\).method = "Richardson"
, calls a routine for Richardson
extrapolation (described above).method = "complex"
is valid only for those functions
for which the function may take complex inputs and handle complex
outputs. This limits the scope of this method to well-defined convenient
mathematical functions; this excludes the majority of applications in
statistics, biometrics, economics, and data science. However, if there
is such a function, then,
numDeriv::grad(..., method = "complex")
will return \(\Im[f(x + \mathrm{i} h)]/h\).This syntax of numDeriv::grad
makes consistent step size
selection a chore because of the differences in method implementations.
If method = "simple"
, the step size is
method.args$eps
(defaults to \(10^{-4}\)), and it is absolute. If
method = "Richardson"
, the step size is the initial step
size in the Richardson extrapolation, and is computed as follows:
Technical remarks.
Why using d*x + (abs(x)<zero.tol) * eps
is
confusing
To dissect the Richardson extrapolation practically, it suffices to make the function print its input arguments.
f <- function(x) {print(x); sin(x)}
x0 <- 1
g1 <- numDeriv::grad(f, x0)
#> [1] 1
#> [1] 1.0001
#> [1] 0.9999
#> [1] 1.00005
#> [1] 0.99995
#> [1] 1.000025
#> [1] 0.999975
#> [1] 1.000012
#> [1] 0.9999875
print(g1)
#> [1] 0.5403023
cat("Auto-detected step:", step.SW(sin, x0)$par, "\n")
#> Auto-detected step: 5e-06
hgrid <- 10^seq(-10, -4, 1/32)
errors <- sapply(hgrid, function(h) Grad(sin, x0, h = h)) - cos(x0)
plot(hgrid, abs(errors), log = "xy", cex = 0.6)
Even with 4 Richardson extrapolations from 8 evaluations, the default
final step size is 1.25e-5
, which is larger than the
optimal step size, which is less than 1e-5
(visible from
the plot), the empirically determined step size 5e-6
, or
even the rule-of-thumb value MachEps^(1/3)
.
The equivalence between numDeriv
and pnd
implementations of the gradient can be established by computing the
optimal weights in the finite difference for the points in the
iterations of the Richardson extrapolation. In this example, the default
sequence of steps is \(\{h_i\}_{i=1}^4 =
10^{-4} / \{1, 2, 4, 8\}\), and the resulting stencil is \(\{\pm h_i\}_{i=1}^4\). The respective
weights can be calculated via \(fdCoef()\):
b <- fdCoef(stencil = c(-(2^(3:0)), 2^(0:3)))
print(b)
#> $stencil
#> [1] -8 -4 -2 -1 1 2 4 8
#>
#> $weights
#> x-8h x-4h x-2h x-1h x+1h
#> 2.204586e-05 -3.703704e-03 1.185185e-01 -7.223986e-01 7.223986e-01
#> x+2h x+4h x+8h
#> -1.185185e-01 3.703704e-03 -2.204586e-05
#>
#> attr(,"remainder.coef")
#> [1] -0.01128748
#> attr(,"accuracy.order")
#> requested effective
#> NA 8
#> attr(,"expansion")
#> [1] "f' - 1.1287e-02 f^(9) + ..."
Here, the weights on the outermost points – the first step of the
iteration with \(h_1\) – are minuscule
(2.2e-5
). The relative importance of each iteration for
this specific function at this specific point can be found from the
equation \[
f'_{\mathrm{Rich,8}} =
\frac{f(x-h_1)-168f(x-h_2)+5376f(x-h_3)-32768f(x-h_4)+32768f(x+h_4)-5376f(x+h_3)+168f(x+h_3)-1f(x+h_4)}{45360h}
= \sum_{i=1}^4 w_{i}[f(x + h_i) - f(x + h_i)],
\] where \(\{w_i\}_{i=1}^4 \approx
\{0.608, -0.0997, +0.0031, -0.00002\}\). This result can be
calculated numerically:
fd <- sin(x0 + b$stencil / 8e4) * b$weights
abs(fd[1:4]) / sum(abs(fd[1:4]))
#> x-8h x-4h x-2h x-1h
#> 2.609937e-05 4.384834e-03 1.403170e-01 8.552721e-01
The absolute contribution of each term is proportional to specific terms of a certain polynomial. Practically, it implies that the weights of the summation terms further away from the point of interest decay exponentially (up to a certain constant), and similar accuracy can be achieved with fewer evaluations. In this example, 85% of the sum is defined by the finite difference with \(h_4\), and 14% with \(h_3\). Therefore, the following function is twice as cheap, and the dis
g2 <- Grad(f, x0, h = 1.25e-05, acc.order = 4, vectorised = TRUE, report = 0)
#> [1] 1
#> [1] 0.9999750 0.9999875
#> [1] 0.999975
#> [1] 0.9999875
#> [1] 1.000012
#> [1] 1.000025
print(g2)
#> [1] 0.5403023
c(diff = g1 - g2, Error8 = cos(x0) - g1, Error4 = cos(x0) - g2)
#> diff Error8 Error4
#> -2.377543e-12 4.636624e-12 2.259082e-12
In this example, the approximation errors from the Richardson extrapolation and a much cheaper weighted sum have the same order of magnitude.
Conclusion: choosing a large initial step and subsequently shrinking it is wasteful because similar, if not indistinguishable, accuracy can be achieved with twice as few evaluations with a reasonably chosen bandwidth. If the function is to be evaluated many times, selecting the optimal step size via a data-driven procedure not only saves time whilst attaining comparable accuracy but also provides opportunities for speed-up via parallel evaluation of the function on the grid. Richardson extrapolation, on the other hand, is typically computed in a loop, and since its fully parallelised implementation is fully subsumed by the weighted-sum approach, the new package does not dedicate any special numerical routine to this particular case.
The term ‘extrapolation group’ can be found, e.g., in Lindfield and Penny (1989), where it is used in
sections on numerical integration and differentiation and describes
sub-intervals on which polynomials of different degrees are fitted to
improve approximation accuracy. Computer programmes in the 1980s were
often optimised for memory use and size, not necessarily for ease of
interpretation or transparency of their correspondence to the
theoretical relationships that they were translating into numbers.
Therefore, a reader might get confused by the terms ‘extrapolation
group’, ‘improvement iteration’, and ‘accuracy order’. The
pnd
package aims to provide more straightforward diagnostic
information.
Suppose that one wants to compute the derivative of \(f(x) = x^7\) at \(x_0 = 1\) – therefore, the true derivative
value is \(f'(1) = 6\). The
following code effectively debugs numDeriv::grad()
and
shows how many values are computed (we choose the initial step size
\(h=2^{-10} = 1/1024\) to avoid any
representation error of \(x_0+h\)):
f <- function(x) {print(x, digits = 16); x^9}
fp1 <- numDeriv::grad(f, x = 1, method.args = list(r = 4, d = 2^-10, show.details = TRUE))
#> [1] 1
#> [1] 1.0009765625
#> [1] 0.9990234375
#> [1] 1.00048828125
#> [1] 0.99951171875
#> [1] 1.000244140625
#> [1] 0.999755859375
#> [1] 1.0001220703125
#> [1] 0.9998779296875
#>
#> first order approximations
#> [,1]
#> [1,] 9.00008010876
#> [2,] 9.00002002717
#> [3,] 9.00000500679
#> [4,] 9.00000125170
#>
#> Richarson improvement group No. 1
#> [,1]
#> [1,] 8.99999999997
#> [2,] 9.00000000000
#> [3,] 9.00000000000
#>
#> Richarson improvement group No. 2
#> [,1]
#> [1,] 9
#> [2,] 9
print(fp1, digits = 16)
#> [1] 9.000000000000064
In total, the function was called 9 times: 1 for the initial
dimensionality check and 2 times per each of the 4 iterations (since ).
In this implementation, the terminology differs from the textbook that
it is referring to: numDeriv
‘s ‘first-order approximations’
correspond to Lindfield & Penny’s ‘Group 1’, ‘Richarsdon improvement
group No. 1’ to LP’s ‘Group 2’, and the final output returned by
grad()
would be ‘Group 4’.
In central differences, \(f(x_0)\)
is not used; with 8 symmetric evaluations, the truncation error of the
result is \(O(h^8)\).
Therefore, the standard textbook results about the optimal step
size for central first differences being proportional to \(\epsilon_{\mathrm{mach}}^{1/3}\)
are not applicable to numDeriv::grad()
because the default accuracy order is 8 and the step size of the
order \(\epsilon_{\mathrm{mach}}^{1/3}\) is
appropriate in this case.
With pnd
, there is no need to re-define the function to
save its computed values because both the grid and the function values
are returned as the attribute.
# f <- function(x) x^9
# fp2 <- pnd::Grad(f, x = 1, h = "SW", acc.order = 8, vectorised1d = TRUE, report = 2)
# print(attributes(fp2)$step.search$iterations, digits = 16)
This highlights an important difference between the two packages:
numDeriv::grad()
starts at a relatively large step size
\(h = 10^{-4} \cdot \max(|x_0|, 1)\)
and repeatedly shrinks it, yielding an exponentially decreasing sequence
\(h, h / v, h / v^2, \ldots\) using the
same reduction factor \(v\) to attain
the desired accuracy order \(a = 2v\) –
slow due to many evaluations, but rather accurate;pnd::Grad()
starts at the rule-of-thumb step size of
the correct order \(\epsilon_{\text{mach}}^{1/(1+a)}\). By
default, it computes central derivatives \(f'_{\mathrm{CD}}(x, h)\) with accuracy
order \(a = 2\) – fast due to
fewer evaluations, smaller default step size that agrees with numerical
analysis literature, estimates of the approximation error are
provided. If accuracy order \(a >
2\) is requested, then, the evaluation grid is linear: \(x \pm h, x \pm 2h, x \pm 3h, \ldots\).Finally, the method argument show.details = TRUE
is
ignored for Hessians in numDeriv
:
g <- function(x) sum(sin(x))
hessian(g, 1:3, method.args = list(show.details = TRUE))
#> [,1] [,2] [,3]
#> [1,] -8.414710e-01 4.072455e-14 -2.864747e-13
#> [2,] 4.072455e-14 -9.092974e-01 1.358524e-14
#> [3,] -2.864747e-13 1.358524e-14 -1.411200e-01
This is due to the fact that the hessian.default()
method calls genD()
, but the latter never checks if
method.args$show.details
is TRUE
in the loops
where the extrapolation is carried out. This silent operation mode was
probably implemented to avoid output verbosity since there are many
elements in matrices. Nevertheless, any user who has not looked at the
source code of hessian()
would be puzzled by this
unexpected behaviour because the manual of ?hessian
explicitly refers to ?grad
and says that
method.args
is passed to grad
.