Abstract

We describe the syntax of the widely popular numDeriv package and show how the functions from the pnd package recognise and handle the parameters related to numerical difference computation. We draw parallels between the two, examine the differences, and provide recommendations for new users.

Numerical derivatives are ubiquitous in applies statistics and numerical methods, which is why many packages rely on streamlined implementations of numerical derivatives as dependencies. The popular numDeriv package by Paul Gilbert and Ravi Varadhan has 314 reverse dependencies as of October 2024. Its flexibility allowed other packages to achieve better results by providing reasonable defaults and the interface for fine-tuning that matters in numerical optimisation and statistical inference. However, it has no parallel capabilities, which is why it takes a lot of time to calculate numerical gradients with its grad function.

This vignette showcases the similarities and differences between the features of numDeriv and pnd for a smooth and painless transition to the new package.

1 Definitions related to numerical derivatives

In this section, we overview the key definitions that relate to the functionality of numDeriv.

A derivative is the instantaneous rate of change of a function (assuming its differentiability): \[ f'(x) = \frac{\mathrm{d} f}{\mathrm{d} x} := \lim_{h\to0} \frac{f(x+h) - f(x)}{h} \]

A numerical derivative is an approximation of the expression above for a small fixed h. For a scalar function, this approximation can be calculated with two function evaluations at two different points yielding forward, central, and backward numerical differences respectively: \[ f_{\mathrm{FD}}' (x, h) := \frac{f(x+h) - f(x)}{h}, \quad f_{\mathrm{CD}}' (x, h) := \frac{f(x+h) - f(x-h)}{2h}, \quad f_{\mathrm{BD}}' (x, h) := \frac{f(x) - f(x-h)}{h} \]

These expressions involve only 2 terms, which limits their accuracy. Ignoring machine errors due to rounding, the approximation error of $f_{\mathrm{FD}}' (x, h)$ and $f_{\mathrm{BD}}' (x, h)$ due to Taylor series truncation is of the order $O(h)$, and for $f_{\mathrm{CD}}' (x, h)$, it is $O(h^2)$; since in practice, $h$ is very small, $f_{\mathrm{CD}}' (x, h)$ is more accurate. The degree to which $h$ is raised in the error term is reflected in the naming: $f_{\mathrm{FD}}' (x, h)$ and $f_{\mathrm{BD}}' (x, h)$ are first-order-accurate, while $f_{\mathrm{CD}}' (x, h)$ is second-order-accurate. Higher accuracy can be achieved by computing more terms at extra points; the following approximation is fourth-order-accurate: \[ f_{\mathrm{CD},4}' (x, h) := \frac{f(x - 2h) -8 f(x-h) + 8 f(x+h) - f(x+2h)}{12h} \] Instead of $(x\pm h, x \pm 2h)$, any other convenient grid may be chosen. For the sake of brevity, we do not compare the accuracy of these approximations and refer to another vignette of this package, ‘Step-size-selection algorithm benchmark’.

Instead of choosing a large evaluation grid in advance, the researcher may instead compute the numerical derivative for several different step sizes, $h_1 > h_2 > \ldots$, and combine multiple approximation to achieve higher accuracy. This is known as Richardson extrapolation: \[ f_{\mathrm{Rich},4}' (x, h_1, h_2) := \frac{(h_1 / h_2)^2 f_{\mathrm{CD}}' (x, h_2) - f_{\mathrm{CD}}' (x, h_1)}{(h_1 / h_2)^2 - 1} \]

If $h_2 = h_1 / 2$, then, this extrapolation formula reduces to $f_{\mathrm{CD},4}' (x, h_2)$. There is a general solution to obtaining high-order-accurate derivatives from function values evaluated at multiple points; see Fornberg (1988) for the solution and the algorithm and Kostyrka (2025) for derivation and stability analysis.

NB. For accuracy order $a > 2$, more than 2 function evaluations are required, which is why the notion of ‘step size’ becomes unclear: it could signify the initial step size in the Richardson extrapolation before the first reduction, or the space between the elements of a symmetric grid $\{\pm h, \pm 2h, \pm 3h, \ldots\}$ excluding zero, or the space between the elements of an equispaced symmetric grid $\{\pm h, \pm 3h, \pm 5h, \ldots\}$, or some other measure describing the scaling of the deviations around the point of interest.

Therefore, to estimate a derivative numerically, the user should be able to supply the following tweaking parameters:

The function $f$ and evaluation point $x$;
The side to determine whether forward, central, or backward differences should be calculated;
The step size $h$;
The desired accuracy order $a$ for the truncation error to be $O(h^a)$ (requiring at least $2a$ evaluations). (Assuming, as described in the note above, that it means either the initial step size or the grid spacing based on the implementation.)

Vector cases and, therefore, Jacobians can be approximated by applying these formulæ to each coordinate of $f$: \[ J_{f_4}(x) := \begin{bmatrix} \frac{\partial f^{(1)}_4}{\partial x^{(1)}} & \cdots & \frac{\partial f^{(1)}_4}{\partial x^{(n)}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f^{(m)}_4}{\partial x^{(1)}} & \cdots & \frac{\partial f^{(m)}_4}{\partial x^{(n)}} \end{bmatrix}(x), \] where each partial derivative is estimated via finite differences.

Both numDeriv and pnd allow the user to choose these parameters. The simplified comparison is given below.

Argument	`numDeriv::grad()` syntax	`pnd::Grad()` syntax
Function $f$	`func`	`FUN`
Point $x$	`x`	`x`
Side	`side`	`side`
Step size	`method.args = list(d = ..., eps = ...)`	`h = ...`
Accuracy order	`method.args = list(r = ...)`	`acc.order = ...`

1.1 Quick start: reproducing the numDeriv vignette

Examples of derivative and gradient calculation:

grad(sin, pi)  # Old
#> [1] -1
Grad(sin, pi)  # New
#> Estimated derivative: -1.0000
#> (default step size: 1.9e-05).

Vector arguments are supported:

grad(sin, (0:10)*2*pi/10)  # Old
#>  [1]  1.000000  0.809017  0.309017 -0.309017 -0.809017 -1.000000 -0.809017
#>  [8] -0.309017  0.309017  0.809017  1.000000
Grad(sin, (0:10)*2*pi/10)  # New
#> Estimated gradient:
#> 1.0000  0.8090  0.3090  -0.3090  -0.8090  -1.0000  -0.8090  -0.3090  0.3090  0.8090  1.0000
#> (default step size: 1.5e-08, 5.2e-06, 7.6e-06, 1.1e-05, 1.5e-05, 1.9e-05, 2.3e-05, 2.7e-05, 3.0e-05, 3.4e-05, 3.8e-05).

func0 <- function(x) sum(sin(x))
grad(func0, (0:10)*2*pi/10)  # Old
#>  [1]  1.000000  0.809017  0.309017 -0.309017 -0.809017 -1.000000 -0.809017
#>  [8] -0.309017  0.309017  0.809017  1.000000
Grad(func0, (0:10)*2*pi/10)  # New
#> Estimated gradient:
#> 1.0000  0.8090  0.3090  -0.3090  -0.8090  -1.0000  -0.8090  -0.3090  0.3090  0.8090  1.0000
#> (default step size: 1.5e-08, 5.2e-06, 7.6e-06, 1.1e-05, 1.5e-05, 1.9e-05, 2.3e-05, 2.7e-05, 3.0e-05, 3.4e-05, 3.8e-05).

func1 <- function(x) sin(10*x) - exp(-x)
curve(func1, from = 0, to = 5)

x <- 2.04
numd1 <- grad(func1, x)  # Old
numd2 <- Grad(func1, x)  # New
numd3 <- Grad(func1, x, h = gradstep(func1, x)$par)  # New auto-selection
exact <- 10*cos(10*x) + exp(-x)
c(Exact = exact, Old = numd1, New = numd2, NewAuto = numd2,
  OldErr = (numd1-exact)/exact, NewErr = (numd2-exact)/exact,
  NewAutoErr = (numd3-exact)/exact)
#>      Exact        Old        New    NewAuto     OldErr     NewErr NewAutoErr 
#>  3.335e-01  3.335e-01  3.335e-01  3.335e-01  8.469e-12 -1.530e-09  2.977e-09

Here, the relative error of the new approach with default settings is worse than that of the old one. Nevertheless, the new implementation is much faster and attains satisfactory accuracy with only 2 function calls, whilst the numDeriv counterpart calls the function 9 times. If an appropriate step size is chosen (e.g. with the showcased auto-selection procedure), then, the reliability of the result increases while the derivative calculation still requires only 2 evaluations.

x <- c(1:10)
numd1 <- grad(func1, x)  # Old
numd2 <- Grad(func1, x)  # New
numd3 <- Grad(func1, x, h = sapply(x, function(y) gradstep(func1, y)$par))
exact <- 10*cos(10*x) + exp(-x)
cbind(Exact = exact, Old = numd1, New = numd2, NewAuto = numd2,
 OldErr = (numd1-exact)/exact, NewErr = (numd2-exact)/exact,
  NewAutoErr = (numd3-exact)/exact)
#>        Exact    Old    New NewAuto     OldErr     NewErr NewAutoErr
#>  [1,] -8.023 -8.023 -8.023  -8.023  6.118e-12 -6.443e-10  1.965e-09
#>  [2,]  4.216  4.216  4.216   4.216  6.271e-12 -2.370e-09  2.843e-09
#>  [3,]  1.592  1.592  1.592   1.592 -3.507e-12 -5.323e-09  2.863e-09
#>  [4,] -6.651 -6.651 -6.651  -6.651  6.698e-12 -9.812e-09  1.502e-08
#>  [5,]  9.656  9.656  9.656   9.656 -7.892e-13 -1.527e-08  1.344e-08
#>  [6,] -9.522 -9.522 -9.522  -9.522 -4.610e-12 -2.200e-08 -1.448e-08
#>  [7,]  6.334  6.334  6.334   6.334  1.243e-11 -2.993e-08  9.228e-09
#>  [8,] -1.104 -1.104 -1.104  -1.104  6.093e-12 -3.913e-08  1.172e-08
#>  [9,] -4.481 -4.481 -4.481  -4.481  1.524e-12 -4.952e-08  7.520e-09
#> [10,]  8.623  8.623  8.623   8.623 -8.120e-13 -6.111e-08 -2.153e-08

Examples of Jacobian calculation:

func2 <- function(x) c(sin(x), cos(x))
x <- (0:1)*2*pi
jacobian(func2, x)
#>      [,1] [,2]
#> [1,]    1    0
#> [2,]    0    1
#> [3,]    0    0
#> [4,]    0    0
Jacobian(func2, x)
#> Estimated Jacobian:
#> 1.0000  1.0000
#> 0       0     
#> (default step size range: 1.5e-08...3.8e-05.)

Examples of Hessian calculation:

x <- 0.25 * pi
hessian(sin, x)
#>            [,1]
#> [1,] -0.7071068
fun1e <- function(x) sum(exp(2*x))
x <- c(1, 3, 5)
hessian(fun1e, x, method.args=list(d=0.01))
#>              [,1]         [,2]         [,3]
#> [1,] 2.955622e+01 2.583236e-15 1.620102e-11
#> [2,] 2.583236e-15 1.613715e+03 5.400691e-12
#> [3,] 1.620102e-11 5.400691e-12 8.810586e+04

1.2 Vectorisation pitfalls

Vectorisation does not occur naturally in computer science; it is rather a feature of a specially designed system than a natural property of hardware or software. From the theoretical point of view, computers are imperfect, restricted versions of Turing machine, analogous to deterministic finite automata or linear bounded automata, capable of reading and writing on a ‘tape’ of practically limited size provided a sequence of instructions. At the hardware level, computers are capable of executing a set of processor instructions, and these instructions are typically supplied sequentially. In fact, vectorisation at the hardware level is a relatively rare phenomenon because without a proper set-up, it may lead to race conditions and non-deterministic behaviour. In certain architectures or extensions (MMX, SSE, AVX), the data for calculations can be set up in such a manner that one single instruction processes multiple inputs, which can be achieved by packing the values into one large register (e.g. four 32-bit numbers into one 128-bit SIMD register). However, it is not easy to organise control flow in such a manner, which is why most computer operations are carried out sequentially.

The strong point of the R language is its vectorisation. One may argue that this feature leads to its inefficiency compared to, e.g., C++. Indeed, in C++, the overhead in loops is typically irreducible and boils down to such hard-to-control factors as jump instructions and data alignment to memory pages, whereas in R, the overhead appear at each loop iteration due to environment manipulation, extra assignment, name look-up, and occasional class-specific method execution. On the other hand, not writing loops is useful when it would take more human time to allocate memory and run a loop than computer time to repeat redundant high-level operations under the hood: what matters is the total time saved, machine and human combined. Finally, if there is no explicit loop in a function because the latter relies internally on apply() or calls C/C++ loops on high-level or computationally heavy objects, then, the overhead due to the high-level nature of R does not create substantial time losses compared to the inner computations, and a vectorised function can be as slow as its loop-based sequential version. This can be seen in the definition of do_vapply in apply.c: the user-supplied function is called in a loop; the implementation of SEXP lapply also contains a C loop. Therefore, in an efficient implementation of a vectorised function, the overhead is minimised if most of the loops in which the input elements are processed are low-level C/C++ loops with as few exchanges and high-level syntactic embellishments as possible.

In R, it is harder to find something that is not a vector. A scalar is just a special case of a vector – with length 1: identical(c(1), 1) is TRUE. Therefore, if a function is implemented in such a manner that it can produce vector outputs for vector inputs en masse, it is considered to be vectorised – with a caveat: dimensionality matters. Consider the following classification:

	Scalar output	Vector output
Scalar input	$f_1\colon \mathbb{R} \mapsto \mathbb{R}$	$f_2\colon \mathbb{R} \mapsto \mathbb{R}^m$
Vector input	$f_3\colon \mathbb{R}^n \mapsto \mathbb{R}$	$f_4\colon \mathbb{R}^n \mapsto \mathbb{R}^m$

To compute a numerical derivative via finite differences, the function must be evaluated at least twice. This means that, depending on the parallel capabilities of a function, an efficient programmer should implement numerical differentiation in such a manner that functions that are vectorised at the low level take as many arguments in one call as possible. Many calls of the same function with shorter arguments are slower than one call with a long argument:

system.time(replicate(1e5, sin(rnorm(10))))
#>    user  system elapsed 
#>   0.271   0.008   0.278
system.time(sin(rnorm(1e6)))
#>    user  system elapsed 
#>   0.073   0.000   0.074

The same phenomenon is observed in the real world with storage media: copying 1000 files of size 1 MB each onto a flash drive is slower than copying one 1-GB file obtained by concatenating said 1000 files.

The case of non-vectorised scalar output for scalar input, $f_1$, can be vectorised trivially by putting the function inside a loop or an apply-like operation. Then, the preparation task for approximating $f'_1$ is creating an array of values at which the function will be evaluated in a loop or in a sequence of parallel jobs (if $f_1$ is computationally costly and the parallelisation overhead is small compared to the evaluation time of $f_1$). This is the case if $f_1(x_0)$ has length 1 for scalar $x_0$ and throws and error for vector $x_0$.

If the function $f_2$ returns a vector of length $m$ for scalar input, then, its coordinate-wise derivatives make up a Jacobian with dimensions $m\times 1$. Such a function should return an error for vector input, but a list of scalar inputs should produce a list of vector outputs in a loop / apply-like call that can be combined to obtain the Jacobian approximation. The user should be aware of this function behaviour and specifically request a Jacobian.

The case of $f_3$ is simple: if the output has length 1 for vector input of length $n$, then it implies that $f_3$ carries out a dimensionality-reducing operation on the input elements. This implies that the numerical derivatives with respect to each coordinate form the gradient vector, and the user should call the function that computes the gradient approximation. The numerical derivative with respect to the first coordinate of $x_0$ is the weighted sum of $f(x_0^{(1)} + h, x_0^{(2)}, \ldots)$, $f(x_0^{(1)} - h, x_0^{(2)}, \ldots)$, and potentially other values obtained at points where only the first coordinate of $x_0$ is altered. This is repeated for each input coordinate: e.g., for central differences, $2n$ evaluations are expected. The input is therefore a list of $2n$ argument vectors of length $n$, and potentially more elements for higher accuracy or higher derivation order.

The case with $f_4$ is the most complicated one, and it is the reason behind the unexpected behaviour of some of the functions in numDeriv. There are two sources of danger in this implementation.

1. If $n = m$, i.e. the user supplies a vector of length $n$ and the function returns a vector of length $n$, it is either due to the true vectorisation of $f_4$ or due to the fact that the output always has length $n$ regardless of the input, and such a coincidence happened that the input had length $n$. It is impossible to claim that $f_4\colon \mathbb{R}^n \mapsto \mathbb{R}^n$ is a vectorised $\vec f_1\colon \mathbb{R} \mapsto \mathbb{R}$ solely from the fact that the output has length $n$ for some input of length $n$; multiple lengths should be checked to confirm this.

Example 1. Consider a function that computes the quartiles of its input series. Its output always has length 3. numDeriv::grad will believe that this function is a vectorised scalar-valued one (i.e. $\vec f_1$) if length(x) == 3:

f <- function(x) quantile(x, 1:3/4)
grad(f, x = 1:2)
#> Error in grad.default(f, x = 1:2): grad assumes a scalar valued function.
grad(f, x = 1:4)
#> Error in grad.default(f, x = 1:4): grad assumes a scalar valued function.
grad(f, x = 1:3)
#> [1] 1.5000000 1.0000000 0.8333333

It is therefore necessary to explicitly call its Jacobian:

jacobian(f, x = 1:3)
#>      [,1] [,2] [,3]
#> [1,]  0.5  0.5  0.0
#> [2,]  0.0  1.0  0.0
#> [3,]  0.0  0.5  0.5
jacobian(f, x = 1:4)
#>      [,1] [,2] [,3] [,4]
#> [1,] 0.25 0.75 0.00 0.00
#> [2,] 0.00 0.50 0.50 0.00
#> [3,] 0.00 0.00 0.75 0.25

This occurrence is more likely if $n$ and $m$ are small: for long inputs, a long output vector of identical length implies almost certainly that $f_4$ is a vectorised $\vec f_1\colon \mathbb{R} \mapsto \mathbb{R}$, enabling efficient evaluation grid preparation to keep the overhead due to loops as low as possible. $\blacksquare$

2. If $n\ne m$ but the function is internally parallelised coordinate-wise, then, dimension checks based on the observed input and output lengths may be flawed and result in unexpected behaviour. Technically, computing $f_4\colon \mathbb{R}^n \mapsto \mathbb{R}^m$ can be seen as repeated application of $f_2\colon \mathbb{R} \mapsto \mathbb{R}^m$ to a list of $n$ points, and its parallelisation can be envisioned as such. However, if $f_4$ is implemented as stacked $f_3\colon \mathbb{R}^n \mapsto \mathbb{R}$, the output should be trated differently.

The Jacobian matrix can be created by column, stacking evaluations of $\frac{\partial}{\partial x} f^{(i)}_4(x)$ at each point $x_1, \ldots, x_n$, or by row, binding the transposed gradients $\nabla^\intercal f^{(i)}_4(x)$ of each coordinate of $f$. In R, matrices can be created by column (default) or by row, which determines the order in which the vector of length $nm$ is wrapped, but the exact method has to be decided by the user or, in case of a package, inferred from the function behaviour.

Example 2. Consider $f(x) := \begin{pmatrix} \sin x \\ \cos x \end{pmatrix}$. Note that the functions sin() and cos() are vectorised and can handle inputs of arbitrary lengths.

f <- function(x) c(sin(x), cos(x))
f(1)
#> [1] 0.8414710 0.5403023
jacobian(f, 1:2)
#>            [,1]       [,2]
#> [1,]  0.5403023  0.0000000
#> [2,]  0.0000000 -0.4161468
#> [3,] -0.8414710  0.0000000
#> [4,]  0.0000000 -0.9092974
jacobian(f, 1:3)
#>            [,1]       [,2]       [,3]
#> [1,]  0.5403023  0.0000000  0.0000000
#> [2,]  0.0000000 -0.4161468  0.0000000
#> [3,]  0.0000000  0.0000000 -0.9899925
#> [4,] -0.8414710  0.0000000  0.0000000
#> [5,]  0.0000000 -0.9092974  0.0000000
#> [6,]  0.0000000  0.0000000 -0.1411200

The Jacobian of $f$ is $(\cos x, -\sin x)$; therefore, the expected output for x0 = 1:2 is $\begin{pmatrix} \cos 1 & \cos 2 \\ -\sin 1 & -\sin 2 \end{pmatrix}$. However, the output is a matrix made of stacked diagonal matrices, and the result is wrong even when $n\ne m$!

The root cause of this error is the manner in which the vector-valued function is defined: the length of the object c(sin(x), cos(x)) depends on the length of x: if length(x) == n, then, length(c(sin(x), cos(x))) == 2*n, and R sees the output as a vector, not a matrix! Hence, jacobian(f, x0) computes f(x0) to check the function dimension, infers from the output f(c(sin(1:2), cos(1:2))) that $m = 4$, $n = 2$, which is wrong, and pre-allocates a $4\times 2$ matrix of zeros for the results. $\blacksquare$

Therefore, without any a priori information about the function inputs and outputs, the function that computes matrices of derivatives should follow the following logic for determining dimensions and handling potential errors:

If the input argument x is a scalar, prepare an evaluation grid in the form of a list, evaluate f on that grid, and combine the result by row. For safety, ignore the fact that f may be vectorised (like sin(1:100)) and may handle vector inputs in some cases.
- If the resulting matrix has 1 column, then, the function is either $f_1\colon \mathbb{R} \mapsto \mathbb{R}$ or $f_3\colon \mathbb{R}^n \mapsto \mathbb{R}$ (since the output has length 1). Proceed with computing the derivative approximation via weighted sums and return the derivative estimate. If a Jacobian was initially requested, throw a warning that the output of f is a scalar and gradient facilities should be used.
- If the resulting matrix has more than 1 column, then, the function is either $f_2\colon \mathbb{R} \mapsto \mathbb{R}^m$ or $f_4\colon \mathbb{R}^n \mapsto \mathbb{R}^m$. Proceed with computing the Jacobian approximation via weighted sums and return the row matrix of the Jacobian estimate. If a gradient was initially requested, throw a warning that the output of f is not a scalar and Jacobian facilities should be used, or return an error that this case should be handled by the Jacobian routine directly.
If the input argument x is a vector, first, it must be determined if f is vectorised $\vec f_1\colon \mathbb{R} \mapsto \mathbb{R}$ or not. Call f(x) and examine the output.
- If the output has the same length as the input, it is highly likely that f is vectorised – for a counterexample where it is not, see Example 1 above showing that if the output is a vector, an input of identical length may create the impression of vectorisation by pure chance. This case can be checked by calling f(x[1]) and checking if the output has length 1. If not, a warning that the user should request a Jacobian for a vector-valued function should be issued. If this check is passed, proceed by assuming vectorisation and prepare the evaluation grid as a long vector to reduce overhead.
- If the output has length 1, then, the function is certainly $f_3\colon \mathbb{R}^n \mapsto \mathbb{R}$, and f should be applied to a list of input vectors for gradient approximation. If the user requested a Jacobian, a warning should be shown.
- If the output length is neither length(x) nor 1, then f definitely belongs to the class $f_4\colon \mathbb{R}^n \mapsto \mathbb{R}^m$. In this case, similarly to the the case with scalar output, a list of vectors should be prepared as the evaluation grid for Jacobian approximation. If the user requested a gradient, throw a warning that the result is a Jacobian matrix with non-null dimensions.

The user should know whether their function is scalar- or vector-valued and call the gradient or Jacobian routine accordingly. Nevertheless, a check should be carried out for dimensionality and vectorisation.

Gradient vectorisation detection is implemented as follows. Firstly, compute f(x).

If length(x) == 1 and length(f(x)) == 1, create the evaluation grid as a list and use apply-like calls because there is no knowledge whether f is $f_1$ or $\vec f_1$.
If length(x) > 1 and length(f(x)) == 1, create the evaluation grid as a list and use apply-like calls. This is the $f_3$ case.
If length(x) > 1 and length(f(x)) > 1, check if length(f(x)) == length(x), create the evaluation grid as a vector and call f on it directly (the $\vec f_1$ case). Otherwise, return an error that a Jacobian should be computed instead because the input vector length is not equal to the output vector length.
If length(x) == 1 and length(f(x)) > 1, return an error indicating that a Jacobian should be computed instead.
- Checking if length(f(x[1])) == 1 to avoid the error from Example 2 above is unreliable because certain functions (especially statistical ones) are not defined for scalars and might return an error. Example: a function that returns the deviations from the trimmed mean where the highest and the lowest values are forcibly deleted could fail if the number of observations is less than 3. Similarly, if the input length determined the number of observations for a model, having a single observation may result in non-invertibility of the model matrix; this error would not occur for long vector inputs.

Jacobians should be checked similarly: compute f(x) and check the following.

If length(f(x)) == 1, return an error indicating that a gradient should be computed instead.
If length(x) > 1 and length(f(x)) > 1, create the evaluation grid as a list and call f on it directly. This is the $\vec f_4$ case. Even if the components of $f$ are vectorised, applying f to a list of points is the only safe option.

2 Breakdown of `numDeriv::grad`

2.1 Handling vectorised inputs

The implementation of numDeriv::grad() is clever: it allows both vectorised and non-vectorised functions to be provided as the input argument. Since gradients are defined for scalar functions, it has to check whether the function of interest is scalar- or vector values. This can be non-trivial with vectorised functions: although $\sin x$ is as scalar function, sin(1:100) will return a vector in R. Therefore, numDeriv::grad() has to determine if the implementation of $f$ is $\mathbb{R}^n \mapsto \mathbb{R}^n$ (scalar / vectorised scalar), or $\mathbb{R}^n \mapsto \mathbb{R}$ (scalar multivariate function), or neither. Hence, functions like $\mathbb{R} \mapsto \mathbb{R}^n$ or $\mathbb{R}^n \mapsto \mathbb{R}^m$ are not considered valid inputs.

Internally, numDeriv::grad(f, x0, ...) computes f0 = f(x0, ...) and checks the length of f0. For the input argument x0 with length(x0) = n, the allowed lengths of f0 are either 1 (scalar output) or n (vectorised output). Expressions such as grad(function(x) c(sin(x), cos(x)), x = 1) return an error, implying that for vector-valued function, the user might want to compute a Jacobian. Similarly, numDeriv::hessian() checks the dimensions of f0 = f(x0, ...) and only allows non-vectorised functions with length(f0) == 1. Finally, the function numDeriv::genD computes the first- and second-derivative information (without cross-derivatives).

This check is not without its drawbacks: it may miscalculate function dimensions. The implementation in pnd::Grad() handles the edge cases that cause wrong outputs in numDeriv::grad().

The pnd package provides similar functions: Grad() is a drop-in replacement for numDeriv::grad(), Hessian() replaces numDeriv::hessian(), Jacobian() subsumes numDeriv::jacobian(), and GenD(), whilst not corresponding to numDeriv::genD(), is the real workhorse that computes arbitrary derivatives of arbitrary accuracy order.

Example 1: correct dimensionality check results for vector-valued functions.

f2 <- function(x) c(sin(x), cos(x))  # Vector output -> gradient is unsupported
grad(f2, x = 1:4)
#> Error in grad.default(f2, x = 1:4): grad assumes a scalar valued function.
hessian(f2, x = 1:4)
#> Error in hessian.default(f2, x = 1:4): Richardson method for hessian assumes a scalar valued function.

Grad(f2, x = 1:4)
#> Error in Grad(f2, x = 1:4): Use 'Jacobian()' instead of 'Grad()' for vector-valued functions to obtain a matrix of derivatives.

This check correctly identifies non-vectorised functions as well:

f2 <- function(x) c(sum(sin(x)), sum(cos(x)))
grad(f2, x = 1:4)
#> Error in grad.default(f2, x = 1:4): grad assumes a scalar valued function.
hessian(f2, x = 1:4)
#> Error in hessian.default(f2, x = 1:4): Richardson method for hessian assumes a scalar valued function.
jacobian(f2, x = 1:4)
#>            [,1]       [,2]       [,3]       [,4]
#> [1,]  0.5403023 -0.4161468 -0.9899925 -0.6536436
#> [2,] -0.8414710 -0.9092974 -0.1411200  0.7568025

Grad(f2, x = 1:4)
#> Error in Grad(f2, x = 1:4): Use 'Jacobian()' instead of 'Grad()' for vector-valued functions to obtain a matrix of derivatives.
Jacobian(f2, x = 1:4)
#> Estimated Jacobian:
#>  0.5403  -0.4161  -0.9900  -0.6536
#> -0.8415  -0.9093  -0.1411   0.7568
#> (default step size range: 6.1e-06...2.4e-05.)

Example 2: valid input, invalid output in numDeriv.

If a function consists of individually vectorised components and returns an output that differs in length from the input, then, numDeriv::jacobian may return wrong results. Specifically, the output should be stacked coordinate-wise gradients, but is made of stacked diagonal matrices instead.

f2 <- function(x) c(sin(x), cos(x))
grad(f2, x = 1:4)
#> Error in grad.default(f2, x = 1:4): grad assumes a scalar valued function.
jacobian(f2, x = 1:4)
#>            [,1]       [,2]       [,3]       [,4]
#> [1,]  0.5403023  0.0000000  0.0000000  0.0000000
#> [2,]  0.0000000 -0.4161468  0.0000000  0.0000000
#> [3,]  0.0000000  0.0000000 -0.9899925  0.0000000
#> [4,]  0.0000000  0.0000000  0.0000000 -0.6536436
#> [5,] -0.8414710  0.0000000  0.0000000  0.0000000
#> [6,]  0.0000000 -0.9092974  0.0000000  0.0000000
#> [7,]  0.0000000  0.0000000 -0.1411200  0.0000000
#> [8,]  0.0000000  0.0000000  0.0000000  0.7568025

2.2 Approximation method

There are 3 methods implemented in numDeriv for gradient calculation: ‘simple’, ‘Richardson’, and ‘complex’. For simplicity, the formulæ given here assume scalar arguments; for vector arguments, these calculations are done coordinate by coordinate.

method = "simple", numDeriv::grad() computes a fast one-sided difference with a fixed step size (the default is 0.0001): $f'_{\mathrm{simple}} := \frac{f(x + h) - f(x)}{h}$ or $\frac{f(x) - f(x-h)}{h}$.
method = "Richardson", calls a routine for Richardson extrapolation (described above).
method = "complex" is valid only for those functions for which the function may take complex inputs and handle complex outputs. This limits the scope of this method to well-defined convenient mathematical functions; this excludes the majority of applications in statistics, biometrics, economics, and data science. However, if there is such a function, then, numDeriv::grad(..., method = "complex") will return $\Im[f(x + \mathrm{i} h)]/h$.

This syntax of numDeriv::grad makes consistent step size selection a chore because of the differences in method implementations. If method = "simple", the step size is method.args$eps (defaults to $10^{-4}$), and it is absolute. If method = "Richardson", the step size is the initial step size in the Richardson extrapolation, and is computed as follows:

If $x \ge \mathrm{zero.tol}$, then, the step size is relative, equal to $h = d|x|$ with default $d = 10^{-4}$ and $\mathrm{zero.tol} = \sqrt{\epsilon_{\mathrm{mach}} / (7 \cdot 10^{-7})} \approx 1.8 \cdot 10^{-5}$;
If $x < \mathrm{zero.tol}$, then, an absolute step is added to the relative one, resulting in the effective step size $h = d|x| + \epsilon$ with default $\epsilon = 10^{-1}$.
A succession of finite differences is computed $r$ times (default $r = 4$) where at each step, $h$ is reduced by a factor of $v$ (default $v = 2$).

Technical remarks.

If the derivative is one-sided, then, the step size is doubled.
If the finite difference value at any step is less than $10^{-20}$ in absolute value, it is set to zero.

3 Compatibility implies syntax support, not identical values

3.1 numDeriv zero handling has a discontinuity

In numDeriv, the default step size – arguably the most important tuning parameter of numerical differences – is equal to d*x + (abs(x)<zero.tol) * eps, where = 1e-4, d = 1e-4, zero.tol = sqrt(.Machine$double.eps/7e-7) (approximately 1.8e-5).

The manual states, ‘eps is used instead of d for elements of x which are zero’, but the implementation is different: if x is just below zero.tol, the actual step is not eps but d*zero.tol + eps.
If x transitions from being below just zero.tol to being just above it, the step size changes from d*zero.tol + eps to d*zero.tol, i.e. it decreases, which creates a discontinuity and a counter-intuitive result.

The second point can be illustrated with an example where the step size for x = 1.7e-5 exceeds that for x = 1.8e-5, which makes little sense mathematically:

f <- function(x) {cat(x, " "); sin(x)}
grad(f, 1.7e-5, method.args = list(r = 2))  # step 1.0e-4
#> 1.7e-05  0.0001170017  -8.30017e-05  6.700085e-05  -3.300085e-05
#> [1] 1
grad(f, 1.8e-5, method.args = list(r = 2))  # step 1.8e-9
#> 1.8e-05  1.80018e-05  1.79982e-05  1.80009e-05  1.79991e-05
#> [1] 1

In pnd, the default step size is decided depending on the value of x:

For x < zero.tol, it is constant and equal to zero.tol, where the revised value of zero.tol is sqrt(.Machine$epsilon), or 1.5e-8;
For x > 1, the step size is equal to x times the machine epsilon to the appropriate power (inverse of the sum of derivative and accuracy orders);
For zero.tol <= x <= 1, the step size is interpolated exponentially between zero.tol and macheps^(1 / (deriv.order + acc.order)).

The chosen value of zero.tol is reasonable in many applications in biostatistics and econometrics where non-negativity of certain parameters in optimisation is often achieved via box constraints with a comparable distance from the boundary.

The default step size as a function of x is illustrated below.

xseq <- 10^seq(-10, 2, 0.25)
sseq2 <- stepx(xseq, acc.order = 2)
sseq4 <- stepx(xseq, acc.order = 4)
matplot(xseq, cbind(sseq2, sseq4), lty = 1:2, col = 1, type = "l", bty = "n",
        log = "xy", main = "Default step size", xlab = "Point", ylab = "Step")
legend("topleft", c("2nd-order accurate", "4th-order accurate"), lty = 1:2)

3.2 Not all evaluations matter for Richardson extrapolation

A detailed description of the Richardson extrapolation is given in Ridders (1982). To dissect the internals, it suffices to make the function print its input arguments.

f <- function(x) {cat(x, " "); sin(x)}
x0 <- 1
g1 <- numDeriv::grad(f, x0)
#> 1  1.0001  0.9999  1.00005  0.99995  1.000025  0.999975  1.000012  0.9999875
print(g1)
#> [1] 0.5403023

cat("Auto-detected step:", step.SW(sin, x0)$par, "\n")
#> Auto-detected step: 5e-06
hgrid <- 10^seq(-10, -4, 1/32)
errors <- sapply(hgrid, function(h) Grad(sin, x0, h = h, cores = 1,
            elementwise = TRUE, vectorised = TRUE, multivalued = FALSE)) - cos(x0)
plot(hgrid, abs(errors), log = "xy", cex = 0.6)

Even with 4 Richardson iterations, the default final step size is 1.25e-5, which is larger than the optimal step size, which is less than 1e-5 (visible from the plot), the empirically determined step size 5e-6, or even the rule-of-thumb value MachEps^(1/3). The equivalence between numDeriv and pnd implementations of the gradient (the latter being more general) can be established by computing the optimal weights in the finite difference for the points in the iterations of the Richardson extrapolation.

In this example, the default sequence of steps is $\{h_i\}_{i=1}^4 = 10^{-4} / \{1, 2, 4, 8\}$, and the resulting stencil is $\{\pm h_i\}_{i=1}^4$. The respective weights can be calculated via fdCoef():

b <- fdCoef(stencil = c(-(2^(3:0)), 2^(0:3)))
print(b)
#> $stencil
#> [1] -8 -4 -2 -1  1  2  4  8
#> 
#> $weights
#>          x-8h          x-4h          x-2h          x-1h          x+1h 
#>  2.204586e-05 -3.703704e-03  1.185185e-01 -7.223986e-01  7.223986e-01 
#>          x+2h          x+4h          x+8h 
#> -1.185185e-01  3.703704e-03 -2.204586e-05 
#> 
#> attr(,"remainder.coef")
#> [1] -0.01128748
#> attr(,"accuracy.order")
#> requested effective 
#>        NA         8 
#> attr(,"expansion")
#> [1] "f ' - 1.1287e-02 f^(9) + ..."

Here, the weights on the outermost points – the first step of the iteration with $h_1$ – are minuscule (2.2e-5). The relative importance of each iteration for this specific function at this specific point can be found from the equation \[\begin{multline*} f'_{\mathrm{Rich,8}} = \sum_{i=1}^4 w_{i}[f(x + h_i) - f(x + h_i)], \end{multline*}\] where $\{w_i\}_{i=1}^4 \approx \{+0.608, -0.0997, +0.0031, -0.00002\}$. These values are available as a length analytical expression with denominator 45360, but in this case, they were calculated numerically:

fd <- sin(x0 + b$stencil / 8 * 1e-4) * b$weights
abs(fd[1:4]) / sum(abs(fd[1:4]))
#>         x-8h         x-4h         x-2h         x-1h 
#> 2.609937e-05 4.384834e-03 1.403170e-01 8.552721e-01

The absolute contribution of each term is proportional to specific terms of a certain polynomial. Practically, it implies that the weights of the summation terms further away from the point of interest decay exponentially (up to a certain constant), and similar accuracy can be achieved with fewer evaluations. In this example, 85.5% of the sum is defined by the finite difference with $h_4$, and 14.0% with $h_3$. Therefore, the following function is twice as cheap, and the difference between the estimated derivative values is negligible.

g2 <- Grad(f, x0, h = 1.25e-05, acc.order = 4,
           elementwise = TRUE, vectorised = TRUE, multivalued = FALSE)
#> 0.999975  0.9999875  1.000012  1.000025
print(g2)
#> Estimated derivative: 0.5403
#> (user-supplied step size: 1.3e-05).

c(diff = g1 - g2, Error8 = cos(x0) - g1, Error4 = cos(x0) - g2)
#>          diff        Error8        Error4 
#> -2.377543e-12  4.636624e-12  2.259082e-12

In this example, the approximation errors from the Richardson extrapolation and a much cheaper weighted sum have the same order of magnitude. By not-so-unlikely coincidence, the actual error of Grad is less than that of grad, which is due to the unpredictable numerical error that is not dominated by the bounded higher-order derivatives truncated in the 4th-order-accurate derivative.

Choosing a large initial step and subsequent shrinking are wasteful because similar, if not practically equivalent, accuracy can be achieved with twice as few evaluations with a reasonably chosen step size. If the function is to be evaluated many times, selecting the optimal step size via a data-driven procedure not only saves time whilst attaining comparable accuracy but also provides opportunities for speed-ups via parallel evaluation of the function on a grid. Richardson extrapolation, on the other hand, is typically computed in a loop, and since its fully parallelised implementation is fully subsumed by the weighted-sum approach, the new package does not dedicate any special numerical routine to this particular case.

Finally, the choice of an optimal non-uniform evaluation grid for a specific function at a specific point remains an open question. One short answer would be ‘the higher the curvature, the smaller the step’, which is the principle used in plotting software to trace the most accurate paths with the fewest evaluations. Another answer comes from the field of optimal polynomial approximation and involves such solutions as Chebyshev polynomials, Padé approximants, and most importantly, the Remez algorithm. Applying these methods to numerical derivatives requires careful handling of differences because the goal is approximating an unknown function that is evaluated with a rounding error, therefore, at least an arbitrary-precision numerical tool must be used in evaluating the derivatives.

4 Diagnostics

4.1 Higher-order accuracy diagnostics

The term ‘extrapolation group’ can be found, e.g., in Lindfield and Penny (1989), where it is used in sections on numerical integration and differentiation and describes sub-intervals on which polynomials of different degrees are fitted to improve approximation accuracy. Computer programmes in the 1980s were often optimised for memory use and size, not necessarily for ease of interpretation or transparency of their correspondence to the theoretical relationships that they were translating into numbers. Therefore, a reader might get confused by the terms ‘extrapolation group’, ‘improvement iteration’, and ‘accuracy order’. The pnd package aims to provide more straightforward diagnostic information.

Suppose that one wants to compute the derivative of $f(x) = x^7$ at $x_0 = 1$ – therefore, the true derivative value is $f'(1) = 6$. The following code effectively debugs numDeriv::grad() and shows how many values are computed (we choose the initial step size $h=2^{-10} = 1/1024$ to avoid any representation error of $x_0+h$):

f <- function(x) {print(x, digits = 16); x^9} 
fp1 <- numDeriv::grad(f, x = 1, method.args = list(r = 4, d = 2^-10, show.details = TRUE))
#> [1] 1
#> [1] 1.0009765625
#> [1] 0.9990234375
#> [1] 1.00048828125
#> [1] 0.99951171875
#> [1] 1.000244140625
#> [1] 0.999755859375
#> [1] 1.0001220703125
#> [1] 0.9998779296875
#> 
#>  first order approximations 
#>               [,1]
#> [1,] 9.00008010876
#> [2,] 9.00002002717
#> [3,] 9.00000500679
#> [4,] 9.00000125170
#> 
#>  Richarson improvement group No.  1 
#>               [,1]
#> [1,] 8.99999999997
#> [2,] 9.00000000000
#> [3,] 9.00000000000
#> 
#>  Richarson improvement group No.  2 
#>      [,1]
#> [1,]    9
#> [2,]    9
print(fp1, digits = 16)
#> [1] 9.000000000000064

In total, the function was called 9 times: 1 for the initial dimensionality check and 2 times per each of the 4 iterations (since ). In this implementation, the terminology differs from the textbook that it is referring to: numDeriv‘s ‘first-order approximations’ correspond to Lindfield & Penny’s ‘Group 1’, ‘Richarsdon improvement group No. 1’ to LP’s ‘Group 2’, and the final output returned by grad() would be ‘Group 4’.

In central differences, $f(x_0)$ is not used; with 8 symmetric evaluations, the truncation error of the result is $O(h^8)$. Therefore, the standard textbook results about the optimal step size for central first differences being proportional to $\epsilon_{\mathrm{mach}}^{1/3}$ are not applicable to numDeriv::grad() because the default accuracy order is 8 and the step size of the order $\epsilon_{\mathrm{mach}}^{1/3}$ is appropriate in this case.

With pnd, there is no need to re-define the function to save its computed values because both the grid and the function values are returned as the attribute.

# f <- function(x) x^9
# fp2 <- pnd::Grad(f, x = 1, h = "SW", acc.order = 8, vectorised1d = TRUE)
# print(attributes(fp2)$step.search$iterations, digits = 16)

This highlights an important difference between the two packages:

numDeriv::grad() starts at a relatively large step size $h = 10^{-4} \cdot \max(|x_0|, 1)$ and repeatedly shrinks it, yielding an exponentially decreasing sequence $h, h / v, h / v^2, \ldots$ using the same reduction factor $v$ to attain the desired accuracy order $a = 2v$ – slow due to many evaluations, but rather accurate;
pnd::Grad() starts at the rule-of-thumb step size of the correct order $\epsilon_{\text{mach}}^{1/(1+a)}$. By default, it computes central derivatives $f'_{\mathrm{CD}}(x, h)$ with accuracy order $a = 2$ – fast due to fewer evaluations, smaller default step size that agrees with numerical analysis literature, estimates of the approximation error are provided. If accuracy order $a > 2$ is requested, then, the evaluation grid is linear: $x \pm h, x \pm 2h, x \pm 3h, \ldots$.

Finally, the method argument show.details = TRUE is ignored for Hessians in numDeriv:

g <- function(x) sum(sin(x))
hessian(g, 1:3, method.args = list(show.details = TRUE))
#>               [,1]          [,2]          [,3]
#> [1,] -8.414710e-01  4.072455e-14 -2.864747e-13
#> [2,]  4.072455e-14 -9.092974e-01  1.358524e-14
#> [3,] -2.864747e-13  1.358524e-14 -1.411200e-01

This is due to the fact that the hessian.default() method calls genD(), but the latter never checks if method.args$show.details is TRUE in the loops where the extrapolation is carried out. This silent operation mode was probably implemented to avoid output verbosity since there are many elements in matrices. Nevertheless, any user who has not looked at the source code of hessian() would be puzzled by this unexpected behaviour because the manual of ?hessian explicitly refers to ?grad and says that method.args is passed to grad.

Compatilibility of pnd with the syntax of numDeriv

Andreï V. Kostyrka, University of Luxembourg

Created: 2024-10-01, last modified: 2025-03-08, compiled: 2025-05-20

2 Breakdown of `numDeriv::grad`

2.1 Handling vectorised inputs

2.2 Approximation method

3 Compatibility implies syntax support, not identical values

3.1 numDeriv zero handling has a discontinuity

3.2 Not all evaluations matter for Richardson extrapolation

4 Diagnostics

4.1 Higher-order accuracy diagnostics

5 References

Argument	`numDeriv::grad()` syntax	`pnd::Grad()` syntax
Function \(f\)	`func`	`FUN`
Point \(x\)	`x`	`x`
Side	`side`	`side`
Step size	`method.args = list(d = ..., eps = ...)`	`h = ...`
Accuracy order	`method.args = list(r = ...)`	`acc.order = ...`

	Scalar output	Vector output
Scalar input	\(f_1\colon \mathbb{R} \mapsto \mathbb{R}\)	\(f_2\colon \mathbb{R} \mapsto \mathbb{R}^m\)
Vector input	\(f_3\colon \mathbb{R}^n \mapsto \mathbb{R}\)	\(f_4\colon \mathbb{R}^n \mapsto \mathbb{R}^m\)

Compatilibility of pnd with the syntax of numDeriv

Andreï V. Kostyrka, University of Luxembourg

Created: 2024-10-01, last modified: 2025-03-08, compiled: 2025-05-20

1 Definitions related to numerical derivatives

1.1 Quick start: reproducing the numDeriv vignette

1.2 Vectorisation pitfalls

2 Breakdown of numDeriv::grad

2.1 Handling vectorised inputs

2.2 Approximation method

3 Compatibility implies syntax support, not identical values

3.1 numDeriv zero handling has a discontinuity

3.2 Not all evaluations matter for Richardson extrapolation

4 Diagnostics

4.1 Higher-order accuracy diagnostics

5 References

2 Breakdown of `numDeriv::grad`