Compatilibility of pnd with the syntax of numDeriv

Andreï V. Kostyrka, University of Luxembourg

Created: 2024-10-01, last modified: 2024-10-01, compiled: 2025-02-25

Abstract

We describe the syntax of the widely popular numDeriv package and show how the functions from the pnd package recognise and handle the parameters related to numerical difference computation. We draw parallels between the two, examine the differences, and provide recommendations for new users.

Load the packages first:

library(pnd)
#> Parallel numerical derivatives v. 0.0.6 (2025-01-23).
#> 4 physical cores for parallelism through mclapply forking are available on Linux.
library(numDeriv)

Numerical derivatives are ubiquitous in applies statistics and numerical methods, which is why many packages rely on streamlined implementations of numerical derivatives as dependencies. The popular numDeriv package by Paul Gilbert and Ravi Varadhan has 314 reverse dependencies as of October 2024. Its flexibility allowed other packages to achieve better results by providing reasonable defaults and the interface for fine-tuning that matters in numerical optimisation and statistical inference. However, it has no parallel capabilities, which is why it takes a lot of time to calculate numerical gradients with its grad function.

This vignette showcases the similarities and differences between the features of numDeriv and pnd for a smooth and painless transition to the new package.

2 Breakdown of numDeriv::grad

2.1 Handling vectorised inputs

The implementation of numDeriv::grad() is clever: it allows both vectorised and non-vectorised functions to be provided as the input argument. Since gradients are defined for scalar functions, it has to check whether the function of interest is scalar- or vector values. This can be non-trivial with vectorised functions: although \(\sin x\) is as scalar function, sin(1:100) will return a vector in R. Therefore, numDeriv::grad() has to determine if the implementation of \(f\) is \(\mathbb{R}^n \mapsto \mathbb{R}^n\) (scalar / vectorised scalar), or \(\mathbb{R}^n \mapsto \mathbb{R}\) (scalar multivariate function), or neither. Hence, functions like \(\mathbb{R} \mapsto \mathbb{R}^n\) or \(\mathbb{R}^n \mapsto \mathbb{R}^m\) are not considered valid inputs.

Internally, numDeriv::grad(f, x0, ...) computes f0 = f(x0, ...) and checks the length of f0. For the input argument x0 with length(x0) = n, the allowed lengths of f0 are either 1 (scalar output) or n (vectorised output). Expressions such as grad(function(x) c(sin(x), cos(x)), x = 1) return an error, implying that for vector-valued function, the user might want to compute a Jacobian. Similarly, numDeriv::hessian() checks the dimensions of f0 = f(x0, ...) and only allows non-vectorised functions with length(f0) == 1. Finally, the function numDeriv::genD computes the first- and second-derivative information (without cross-derivatives).

This check is not without its drawbacks: it may miscalculate function dimensions. The implementation in pnd::Grad() handles the edge cases that cause wrong outputs in numDeriv::grad().

The pnd package provides similar functions: Grad() is a drop-in replacement for numDeriv::grad(), Hessian() replaces numDeriv::hessian(), Jacobian() subsumes numDeriv::jacobian(), and GenD(), whilst not corresponding to numDeriv::genD(), is the real workhorse that computes arbitrary derivatives of arbitrary accuracy order.

Example 1: correct dimensionality check results for vector-valued functions.

f2 <- function(x) c(sin(x), cos(x))  # Vector output -> gradient is unsupported
grad(f2, x = 1:4)
#> Error in grad.default(f2, x = 1:4): grad assumes a scalar valued function.
hessian(f2, x = 1:4)
#> Error in hessian.default(f2, x = 1:4): Richardson method for hessian assumes a scalar valued function.

Grad(f2, x = 1:4)
#> Error in Grad(f2, x = 1:4): Use 'Jacobian()' instead of 'Grad()' for vector-valued functions to obtain a matrix of derivatives.

This check correctly identifies non-vectorised functions as well:

f2 <- function(x) c(sum(sin(x)), sum(cos(x)))
grad(f2, x = 1:4)
#> Error in grad.default(f2, x = 1:4): grad assumes a scalar valued function.
hessian(f2, x = 1:4)
#> Error in hessian.default(f2, x = 1:4): Richardson method for hessian assumes a scalar valued function.
jacobian(f2, x = 1:4)
#>            [,1]       [,2]       [,3]       [,4]
#> [1,]  0.5403023 -0.4161468 -0.9899925 -0.6536436
#> [2,] -0.8414710 -0.9092974 -0.1411200  0.7568025

Grad(f2, x = 1:4)
#> Error in Grad(f2, x = 1:4): Use 'Jacobian()' instead of 'Grad()' for vector-valued functions to obtain a matrix of derivatives.
Jacobian(f2, x = 1:4)
#>            [,1]       [,2]       [,3]       [,4]
#> [1,]  0.5403023 -0.4161468 -0.9899925 -0.6536436
#> [2,] -0.8414710 -0.9092974 -0.1411200  0.7568025
#> attr(,"step.size")
#> [1] 6.055454e-06 1.211091e-05 1.816636e-05 2.422182e-05
#> attr(,"step.size.method")
#> [1] "default"

Example 2: valid input, invalid output in numDeriv.

If a function consists of individually vectorised components and returns an output that differs in length from the input, then, numDeriv::jacobian may return wrong results. Specifically, the output should be stacked coordinate-wise gradients, but is made of stacked diagonal matrices instead.

f2 <- function(x) c(sin(x), cos(x))
grad(f2, x = 1:4)
#> Error in grad.default(f2, x = 1:4): grad assumes a scalar valued function.
jacobian(f2, x = 1:4)
#>            [,1]       [,2]       [,3]       [,4]
#> [1,]  0.5403023  0.0000000  0.0000000  0.0000000
#> [2,]  0.0000000 -0.4161468  0.0000000  0.0000000
#> [3,]  0.0000000  0.0000000 -0.9899925  0.0000000
#> [4,]  0.0000000  0.0000000  0.0000000 -0.6536436
#> [5,] -0.8414710  0.0000000  0.0000000  0.0000000
#> [6,]  0.0000000 -0.9092974  0.0000000  0.0000000
#> [7,]  0.0000000  0.0000000 -0.1411200  0.0000000
#> [8,]  0.0000000  0.0000000  0.0000000  0.7568025

2.2 Approximation method

There are 3 methods implemented in numDeriv for gradient calculation: ‘simple’, ‘Richardson’, and ‘complex’. For simplicity, the formulæ given here assume scalar arguments; for vector arguments, these calculations are done coordinate by coordinate.

This syntax of numDeriv::grad makes consistent step size selection a chore because of the differences in method implementations. If method = "simple", the step size is method.args$eps (defaults to \(10^{-4}\)), and it is absolute. If method = "Richardson", the step size is the initial step size in the Richardson extrapolation, and is computed as follows:

Technical remarks.

3 Compatibility implies syntax support, not identical values

3.1 Zero tolerances may cause a discontinuity

Why using d*x + (abs(x)<zero.tol) * eps is confusing

4 Richardson extrapolation

To dissect the Richardson extrapolation practically, it suffices to make the function print its input arguments.

f <- function(x) {print(x); sin(x)}
x0 <- 1
g1 <- numDeriv::grad(f, x0)
#> [1] 1
#> [1] 1.0001
#> [1] 0.9999
#> [1] 1.00005
#> [1] 0.99995
#> [1] 1.000025
#> [1] 0.999975
#> [1] 1.000012
#> [1] 0.9999875
print(g1)
#> [1] 0.5403023

cat("Auto-detected step:", step.SW(sin, x0)$par, "\n")
#> Auto-detected step: 5e-06
hgrid <- 10^seq(-10, -4, 1/32)
errors <- sapply(hgrid, function(h) Grad(sin, x0, h = h)) - cos(x0)
plot(hgrid, abs(errors), log = "xy", cex = 0.6)

Even with 4 Richardson extrapolations from 8 evaluations, the default final step size is 1.25e-5, which is larger than the optimal step size, which is less than 1e-5 (visible from the plot), the empirically determined step size 5e-6, or even the rule-of-thumb value MachEps^(1/3).

The equivalence between numDeriv and pnd implementations of the gradient can be established by computing the optimal weights in the finite difference for the points in the iterations of the Richardson extrapolation. In this example, the default sequence of steps is \(\{h_i\}_{i=1}^4 = 10^{-4} / \{1, 2, 4, 8\}\), and the resulting stencil is \(\{\pm h_i\}_{i=1}^4\). The respective weights can be calculated via \(fdCoef()\):

b <- fdCoef(stencil = c(-(2^(3:0)), 2^(0:3)))
print(b)
#> $stencil
#> [1] -8 -4 -2 -1  1  2  4  8
#> 
#> $weights
#>          x-8h          x-4h          x-2h          x-1h          x+1h 
#>  2.204586e-05 -3.703704e-03  1.185185e-01 -7.223986e-01  7.223986e-01 
#>          x+2h          x+4h          x+8h 
#> -1.185185e-01  3.703704e-03 -2.204586e-05 
#> 
#> attr(,"remainder.coef")
#> [1] -0.01128748
#> attr(,"accuracy.order")
#> requested effective 
#>        NA         8 
#> attr(,"expansion")
#> [1] "f' - 1.1287e-02 f^(9) + ..."

Here, the weights on the outermost points – the first step of the iteration with \(h_1\) – are minuscule (2.2e-5). The relative importance of each iteration for this specific function at this specific point can be found from the equation \[ f'_{\mathrm{Rich,8}} = \frac{f(x-h_1)-168f(x-h_2)+5376f(x-h_3)-32768f(x-h_4)+32768f(x+h_4)-5376f(x+h_3)+168f(x+h_3)-1f(x+h_4)}{45360h} = \sum_{i=1}^4 w_{i}[f(x + h_i) - f(x + h_i)], \] where \(\{w_i\}_{i=1}^4 \approx \{0.608, -0.0997, +0.0031, -0.00002\}\). This result can be calculated numerically:

fd <- sin(x0 + b$stencil / 8e4) * b$weights
abs(fd[1:4]) / sum(abs(fd[1:4]))
#>         x-8h         x-4h         x-2h         x-1h 
#> 2.609937e-05 4.384834e-03 1.403170e-01 8.552721e-01

The absolute contribution of each term is proportional to specific terms of a certain polynomial. Practically, it implies that the weights of the summation terms further away from the point of interest decay exponentially (up to a certain constant), and similar accuracy can be achieved with fewer evaluations. In this example, 85% of the sum is defined by the finite difference with \(h_4\), and 14% with \(h_3\). Therefore, the following function is twice as cheap, and the dis

g2 <- Grad(f, x0, h = 1.25e-05, acc.order = 4, vectorised = TRUE, report = 0)
#> [1] 1
#> [1] 0.9999750 0.9999875
#> [1] 0.999975
#> [1] 0.9999875
#> [1] 1.000012
#> [1] 1.000025
print(g2)
#> [1] 0.5403023

c(diff = g1 - g2, Error8 = cos(x0) - g1, Error4 = cos(x0) - g2)
#>          diff        Error8        Error4 
#> -2.377543e-12  4.636624e-12  2.259082e-12

In this example, the approximation errors from the Richardson extrapolation and a much cheaper weighted sum have the same order of magnitude.

Conclusion: choosing a large initial step and subsequently shrinking it is wasteful because similar, if not indistinguishable, accuracy can be achieved with twice as few evaluations with a reasonably chosen bandwidth. If the function is to be evaluated many times, selecting the optimal step size via a data-driven procedure not only saves time whilst attaining comparable accuracy but also provides opportunities for speed-up via parallel evaluation of the function on the grid. Richardson extrapolation, on the other hand, is typically computed in a loop, and since its fully parallelised implementation is fully subsumed by the weighted-sum approach, the new package does not dedicate any special numerical routine to this particular case.

5 Diagnostics

5.1 Higher-order accuracy diagnostics

The term ‘extrapolation group’ can be found, e.g., in Lindfield and Penny (1989), where it is used in sections on numerical integration and differentiation and describes sub-intervals on which polynomials of different degrees are fitted to improve approximation accuracy. Computer programmes in the 1980s were often optimised for memory use and size, not necessarily for ease of interpretation or transparency of their correspondence to the theoretical relationships that they were translating into numbers. Therefore, a reader might get confused by the terms ‘extrapolation group’, ‘improvement iteration’, and ‘accuracy order’. The pnd package aims to provide more straightforward diagnostic information.

Suppose that one wants to compute the derivative of \(f(x) = x^7\) at \(x_0 = 1\) – therefore, the true derivative value is \(f'(1) = 6\). The following code effectively debugs numDeriv::grad() and shows how many values are computed (we choose the initial step size \(h=2^{-10} = 1/1024\) to avoid any representation error of \(x_0+h\)):

f <- function(x) {print(x, digits = 16); x^9} 
fp1 <- numDeriv::grad(f, x = 1, method.args = list(r = 4, d = 2^-10, show.details = TRUE))
#> [1] 1
#> [1] 1.0009765625
#> [1] 0.9990234375
#> [1] 1.00048828125
#> [1] 0.99951171875
#> [1] 1.000244140625
#> [1] 0.999755859375
#> [1] 1.0001220703125
#> [1] 0.9998779296875
#> 
#>  first order approximations 
#>               [,1]
#> [1,] 9.00008010876
#> [2,] 9.00002002717
#> [3,] 9.00000500679
#> [4,] 9.00000125170
#> 
#>  Richarson improvement group No.  1 
#>               [,1]
#> [1,] 8.99999999997
#> [2,] 9.00000000000
#> [3,] 9.00000000000
#> 
#>  Richarson improvement group No.  2 
#>      [,1]
#> [1,]    9
#> [2,]    9
print(fp1, digits = 16)
#> [1] 9.000000000000064

In total, the function was called 9 times: 1 for the initial dimensionality check and 2 times per each of the 4 iterations (since ). In this implementation, the terminology differs from the textbook that it is referring to: numDeriv‘s ‘first-order approximations’ correspond to Lindfield & Penny’s ‘Group 1’, ‘Richarsdon improvement group No. 1’ to LP’s ‘Group 2’, and the final output returned by grad() would be ‘Group 4’.

In central differences, \(f(x_0)\) is not used; with 8 symmetric evaluations, the truncation error of the result is \(O(h^8)\). Therefore, the standard textbook results about the optimal step size for central first differences being proportional to \(\epsilon_{\mathrm{mach}}^{1/3}\) are not applicable to numDeriv::grad() because the default accuracy order is 8 and the step size of the order \(\epsilon_{\mathrm{mach}}^{1/3}\) is appropriate in this case.

With pnd, there is no need to re-define the function to save its computed values because both the grid and the function values are returned as the attribute.

# f <- function(x) x^9
# fp2 <- pnd::Grad(f, x = 1, h = "SW", acc.order = 8, vectorised1d = TRUE, report = 2)
# print(attributes(fp2)$step.search$iterations, digits = 16)

This highlights an important difference between the two packages:

Finally, the method argument show.details = TRUE is ignored for Hessians in numDeriv:

g <- function(x) sum(sin(x))
hessian(g, 1:3, method.args = list(show.details = TRUE))
#>               [,1]          [,2]          [,3]
#> [1,] -8.414710e-01  4.072455e-14 -2.864747e-13
#> [2,]  4.072455e-14 -9.092974e-01  1.358524e-14
#> [3,] -2.864747e-13  1.358524e-14 -1.411200e-01

This is due to the fact that the hessian.default() method calls genD(), but the latter never checks if method.args$show.details is TRUE in the loops where the extrapolation is carried out. This silent operation mode was probably implemented to avoid output verbosity since there are many elements in matrices. Nevertheless, any user who has not looked at the source code of hessian() would be puzzled by this unexpected behaviour because the manual of ?hessian explicitly refers to ?grad and says that method.args is passed to grad.

6 References

Fornberg, Bengt. 1988. “Generation of Finite Difference Formulas on Arbitrarily Spaced Grids.” Mathematics of Computation 51 (184): 699–706. https://doi.org/10.1090/S0025-5718-1988-0935077-0.
Kostyrka, Andreï V. 2025. “What Are You Doing, Step Size: Fast Computation of Accurate Numerical Derivatives with Finite Precision.” Working paper.
Lindfield, G. R., and J. E. T. Penny. 1989. Microcomputers in Numerical Analysis. Halsted Press.