How to register new layer datatypes

Purpose of this vignette

This vignette explains how to tweak “gmGeostasts” to declare new datatypes for data layers (e.g. for data with special characteristics besides compositional data), and associate creation and prediction function to it. If you are looking for a general introduction to the package, see this other vignette

library(compositions)
#> Welcome to compositions, a package for compositional data analysis.
#> Find an intro with "? compositions"
#> 
#> Attaching package: 'compositions'
#> The following object is masked from 'package:gmGeostats':
#> 
#>     logratioVariogram
#> The following objects are masked from 'package:stats':
#> 
#>     anova, cor, cov, dist, var
#> The following object is masked from 'package:graphics':
#> 
#>     segments
#> The following objects are masked from 'package:base':
#> 
#>     %*%, norm, scale, scale.default
library(gstat)
#> 
#> Attaching package: 'gstat'
#> The following object is masked from 'package:compositions':
#> 
#>     fit.lmc
#> The following object is masked from 'package:gmGeostats':
#> 
#>     variogram
library(gmGeostats)
library(magrittr)

Statistical scale and representation functions

The statistical scale of a data layer is a subjective assessment of the way in which pairs of values of that layer need to be compared. Classical statistical scales after Stevens (1946) are the nominal (two values are either equal or they are different), ordinal (two values are either equal, or one is larger than the other), interval (values can be meaningfully compared by the mathematical operation of subtraction) and ratio (values are strictly positive and can be meaningfully compared by the operation of quotient). Other scales have been introduced, such as several compositional scales for data about the amounts and proportions of components forming a system (Aitchison, 1986; van den Boogaart and Tolosana-Delgado, 2013; van den Boogaart, Tolosana-Delgado and Bren, 2021); for circular and spherical data; for distributional data; for positive definite matrices; etc.

A scale \(s\) is then typically described by a way of computing the difference between two values \(d_s(\cdot, \cdot)\), coupled with a description of the set \(\mathcal{E}_s\) of values of that layer that are at all possible. In the case of circular data, the set of possible values is \([-\pi, \pi)\), and given the periodicity condition, the way to compare two values \(a\) and \(b\) is \((a-b)\) modulo \(\pi\).

To ease the computation with observations \(x\) on such layers we want to define a transformation \(R(x)\) that delivers a representation \(z=R(x)\) of the data, such that: (i) \(R^{-1}(z)\) exists for all values of the linear span of \(R(x)\), (ii) it can be ensured that \(R^{-1}(z)\in \mathcal{E}_s\), and (iii) that \(d_s(x_1, x_2)\approx R(x_1)-R(x_2)\). A classical representation strategy of circular data is through an embedding into \(R^2\) the bivariate real space, by means of the transformation \[ R(\theta)=[\sin(\theta), \cos(\theta)]=\mathbf{z} \] with inverse operation \[ \theta = \tan^{-1} \frac{z_2}{z_1} \]

For our purposes, the absolute minimum you need to program is:

choose a class name for your data tyoe (e.g. “periodic”), and create a function with that name taking the values as they might be in your case study, converting them into a datamatrix and giving them the class of your choice

circular = function(x, varname ="theta", conversion=pi/180){
  # output to be a (N, 1)-datamatrix 
  if(length(dim(x))!=2){ # case `x` is a vector
    y = t(t(x))
    colnames(y) = varname
  }else if(nrow(x)!=1){ # case `x` is a too large matrix
    y = x[, varname]
  }
  y = y * conversion
  class(y) = "circular"
  return(y)
}

(The function can do other things, like in this case, allowing a potential conversion from degrees to radians, or managing several input cases).

create method of the function compositions::cdt() implementing the representation for data of your type and returning an rmult object

cdt.circular = function(x, ...){ 
  z = cbind(sin(x), cos(x))
  colnames(z) = c("z1", "z2")
  return(rmult(z, orig=x))
}

It is important that your cdt() method makes use of the two main arguments of the compositions::rmult() function: z (for the transformed scores) and orig (for the untransformed data).

create method of the function compositions::cdtInv() implementing the backrepresentation for data of your type (argument z expects the representation, and orig must be exactly what is set in this example; NOTICE the three dots at compositions:::gsi.orig)

cdtInv.circular = function(z, orig=compositions:::gsi.orig(x),...){ 
  #z = unclass(z)
  x = atan2(z[,2], z[,1])
  return(circular(x, varname=colnames(orig), conversion = 1))
}

Geostatistical analysis of scaled data, quick and dirty

With these few lines of programming you could already be able to use “gmGeostats” for your data. To show how, we will generate first a univariate, real-valued random field, take it as if it were circular data (in radians), extract some components out of it, and do a geostatistical analysis with the output.

## model setup
set.seed(333275)
xdt = data.frame(x=0, y=0, z=0) # one point is necesary
vg = vgm(model="Exp", psill=1, nugget=0, range=1, anis=c(30, 0.8)) # variogram model
gs = gstat(id="z", formula=z~1, locations = ~x+y , data=xdt, nmax=10, model=vg)
## sample point coordinates  
x <- runif(2000, min = 0, max = 10) # values between 0 and 10
Xdt = data.frame(x=x[1:1000], y=x[1001:2000])
# simulate random function
Z = predict(gs, newdata=Xdt, nsim=1)
#> drawing 1 GLS realisation of beta...
#> [using conditional Gaussian simulation]
# select columns
Zdt = Z[,3]
# define and plot data
Zdtc = circular(Zdt, varname = "theta", conversion = 1)
pairsmap(Zdtc, loc=Xdt)

Now we can proceed with the analysis. First we create the “gmSpatialModel” containing the transformed data

theta.gg = 
  make.gmMultivariateGaussianSpatialModel(
    data=cdt(Zdtc), coords = Xdt, # always use cdt in such cases!
    formula = ~1 # for ordinary (co)kriging
    )

compute and plot the variogram

theta.vg = gmGeostats::variogram(theta.gg)
plot(theta.vg)

and model it, in this case with a combination of short range exponental (of effective range approximately 1), a long-range spherical (range approx 3), and a nugget zero. All ways of modelling variograms are allowed, for instance with “gstat” variograms

theta.md = gstat::vgm(model="Exp", range=1/3, psill=0.1) %>% 
  {gstat::vgm(add.to=., model="Sph", range=3, psill=0.1)}
theta.gs = fit_lmc(v=theta.vg, g = theta.gg, model = theta.md)
plot(theta.vg, model=theta.gs$model)

Finally we extend the original data container with this model

theta.gg = 
  make.gmMultivariateGaussianSpatialModel(
    data=cdt(Zdtc), coords = Xdt, # always set V="clr" in such cases!
    formula = ~1, # for ordinary (co)kriging
    model = theta.gs$model
    )

The outcome can then be used for validation, prediction or simulation. Here we do cokriging on a \(0.1\)-step grid between \(0\) and \(10\)

x <- seq(from=0, to=10, by=0.1)
xx = expand.grid(x,x)
colnames(xx) = colnames(Xdt)
ng = KrigingNeighbourhood(nmax = 20, omax=7, maxdist=1)
theta.prds = predict(theta.gg, newdata = xx, pars=ng)
#> starting cokriging
#> Linear Model of Coregionalization found. Good.
#> [using ordinary cokriging]

and the result be reordered and backtransformed

theta.prds.grid = gsi.gstatCokriging2rmult(theta.prds)
theta.prds.back = backtransform(theta.prds.grid, as = cdt(Zdtc))
summary(theta.prds.back)
#>      theta        
#>  Min.   :-2.3318  
#>  1st Qu.:-0.2348  
#>  Median : 0.3163  
#>  Mean   : 0.3476  
#>  3rd Qu.: 0.9490  
#>  Max.   : 2.6576

Note that the function backtransform() is available in package “compositions” from version 1.0.1-9002, and it expects as the second argument as a transformed data set, typically the original one. To plot the result we might have to program a method for image_cokriged that should take care to fictionally reclass the backtransformed data to “spatialGridRmult” and choose a color sequence appropriate for the periodic nature of the data

image_cokriged.circular = function(x, ...){
  class(x) = c("spatialGridRmult", "rmult")
  image_cokriged(x, breaks=40, col=rainbow(10), ...)
}
image_cokriged(theta.prds.back, ivar="theta")

An excursion on superclasses

“gmGeostats” uses a mixture of S3 and S4 classes to manage the several kinds of objects, S3 classes mostly preferred for simple configuration objects, models and data elements, and S4 classes mostly in use for large compound spatial models and data containers. S4 classes, though being somewhat more complex to handle and slightly slower, have the advantage to allow for multiple dispatch, which this package extensively uses. S4 classes require its fields (called “slots”) to strictly belong to a specific class. To handle this condition, and at the same time allow for multiple methods of specification, estimation, fitting and prediction of spatial models and random functions, “gmGeostats” provides a series of abstract classes allowing to control that certain fields do contain only certain kinds of objects:

gmNeighbourhoodSpecification contains a description of the neighbourhood of a point during interpolation/simulation.

EmpiricalStructuralFunctionSpecification contains a desciption of empirical structural functions, typically empirical variograms in different formats (e.g. “gstatVariogram” from package “gstat”, or “logratioVariogram” from package “compositions”).

ModelStructuralFunctionSpecification, equivalent to the preceding one, this class contains specifications of models for structural functions (e.g. “variogramModel” or “CompLinModCoReg” for packages “gstat” resp. “compositions”).

gmValidationStrategy describes the way a model should be validated.

gmGaussianSimulationAlgorithm specifies the exact gaussian simulation algorithm to be used, and provides its parameters (e.g. number of bands for Turning Bands).

gmTrainingImage for multipoint statistics (MPS) methods, this abstract class gathers all ways to specify a gridded image.

gmUnconditionalSpatialModel convenience class of the union of “gmTrainingImage” and “gmGaussianModel” (a concrete class containing “ModelStructuralFunctionSpecification” with some extra information), it is thought to contain all future specifications of an unconditional random function, beyond the two members currently contained.

gmMPSParameters, analogous to “gmGaussianSimulationAlgorithm” or “gmValidationStrategy”, this abstract class contains all specifications of MPS algorithms available.

gmSpatialMethodParameters is a large container of both two-point and multipoint methods, i.e. descriptors of specific algorithms and their parameters. This union class should only contain other abstract claases!

The package “methods” provides a way of checking the subclasses and superclasses of any specific class, thanks to the function classesToAM():

classesToAM("gmSpatialMethodParameters", includeSubclasses = TRUE)
#>                              gSMP gmNS gMPS gmVS NULL gmKN gDSP LvOO NfCV .NUL
#> gmSpatialMethodParameters       0    0    0    0    0    0    0    0    0    0
#> gmNeighbourhoodSpecification    1    0    0    0    0    0    0    0    0    0
#> gmMPSParameters                 1    0    0    0    0    0    0    0    0    0
#> gmValidationStrategy            1    0    0    0    0    0    0    0    0    0
#> NULL                            0    1    1    1    0    0    0    0    0    0
#> gmKrigingNeighbourhood          0    1    0    0    0    0    0    0    0    0
#> gmDirectSamplingParameters      0    0    1    0    0    0    0    0    0    0
#> LeaveOneOut                     0    0    0    1    0    0    0    0    0    0
#> NfoldCrossValidation            0    0    0    1    0    0    0    0    0    0
#> .NULL                           0    0    0    0    1    0    0    0    0    0

The matrix contain a 1 if the row class is a subclass of the column class, and 0 otherwise.

Adapted empirical structural functions

Brief theoretical background

Directional data can be treated within the framework of complex random fields \[ Z(x) = U(x) + i V(x) \] The structural tool for complex data is the complex covariance function (Wackernagel, 2003; de Iaco et al, 2013) \[ C(h) = \underbrace{C_{UU}(h) + C_{VV}(h)}_{C^{Re}(h)} + i\underbrace{(C_{VU}(h)-C_{UV}(h))}_{C^{Im}(h)} \] where \(C_{UU}\) and \(C_{VV}\) are the direct covariances of resp. the real and imaginary part of the complex random field (in the case of the circular representation analogous to the direct covariances of \(sin(.)\) and \(cos(.)\)), and \(C_{UV}\) and \(C_{VU}\) are the cross-covariance and real and imaginary part (analogous to the cross-covariance of \(sin(.)\) and \(cos(.)\)).

A first approach

Let us thus define a function that computes \(C^{Re}(h)\) and \(C^{Im}(h)\) from \(C_{UU}(h), C_{VV}(h)\) and \(C_{UV}(h)\):

# compute covariogram!
theta.cvg = gmGeostats::variogram(theta.gg, covariogram=TRUE)
class(theta.cvg)
#> [1] "gstatVariogram" "data.frame"
# structure of the gstatVariogram object
head(theta.cvg)
#>      np      dist       gamma dir.hor dir.ver    id
#> 1  3052 0.2085728 -0.07013820       0       0 z1.z2
#> 2  8836 0.4841303 -0.06561390       0       0 z1.z2
#> 3 13802 0.7961717 -0.04900795       0       0 z1.z2
#> 4 18656 1.1083466 -0.03414289       0       0 z1.z2
#> 5 23450 1.4191027 -0.02448557       0       0 z1.z2
#> 6 26866 1.7309977 -0.01574590       0       0 z1.z2
# values controlling the split in direct and cross-variograms
theta.vg$id
#>  [1] z1.z2 z1.z2 z1.z2 z1.z2 z1.z2 z1.z2 z1.z2 z1.z2 z1.z2 z1.z2 z1.z2 z1.z2
#> [13] z1.z2 z1.z2 z1.z2 z2    z2    z2    z2    z2    z2    z2    z2    z2   
#> [25] z2    z2    z2    z2    z2    z2    z1    z1    z1    z1    z1    z1   
#> [37] z1    z1    z1    z1    z1    z1    z1    z1    z1   
#> Levels: z1.z2 z2 z1

# function doing the recalculations
recompute_complex_cov = function(cv){
  # split the gstatVariogram structure in the individual vgrams
  aux = split(cv[, -6], cv$id)
  # ad-hoc function taking two vgrams and operating their gamma column
  sumGamma = function(x,y, alpha=1){
    x$gamma = x$gamma + alpha*y$gamma
    return(x)
  }
  # compute C^{Im} and C^{Re}
  aaxx = list(Im=sumGamma(aux$z1.z2, aux$z1.z2, -1), 
              Re=sumGamma(aux$z1, aux$z2)
              )
  # undo the split
  f = rep(c("Im", "Re"), each=nrow(aux$z1))
  res = unsplit(aaxx, f)
  res$id = as.factor(f)
  # restore the class and return 
  class(res) = class(cv)
  return(res)
}
# do the calculations!
theta.vg %>% recompute_complex_cov %>% plot

This is again a quick and dirty solution, as:

by using the plotting capacities of “gstat” for an incomplete object we waste half the space of the plot;
actually the current function can only deal with symmetric covariances (because of the way that gstat::variogram estimates the covariance), and
hence the imaginary part is identically zero.

Non-symmetric covariance

We need to improve the computations slightly. First we need to consider estimates for the whole 360\(^o\), so that we can obtain \(C_{VU}(h)=C_{UV}(-h)\),

# compute variogram for the whole circle, i.e. until 360 deg
theta.cvg = gmGeostats::variogram(
  theta.gg, covariogram=TRUE, alpha=(0:11)*30)
# how are the directions structured?
theta.cvg[theta.cvg$id=="z1", "dir.hor"]
#>   [1]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  30  30
#>  [19]  30  30  30  30  30  30  30  30  30  30  30  30  30  30  60  60  60  60
#>  [37]  60  60  60  60  60  60  60  60  60  60  60  60  90  90  90  90  90  90
#>  [55]  90  90  90  90  90  90  90  90  90  90 120 120 120 120 120 120 120 120
#>  [73] 120 120 120 120 120 120 120 120 150 150 150 150 150 150 150 150 150 150
#>  [91] 150 150 150 150 150 150 180 180 180 180 180 180 180 180 180 180 180 180
#> [109] 180 180 180 180 210 210 210 210 210 210 210 210 210 210 210 210 210 210
#> [127] 210 210 240 240 240 240 240 240 240 240 240 240 240 240 240 240 240 240
#> [145] 270 270 270 270 270 270 270 270 270 270 270 270 270 270 270 270 300 300
#> [163] 300 300 300 300 300 300 300 300 300 300 300 300 300 300 330 330 330 330
#> [181] 330 330 330 330 330 330 330 330 330 330 330 330

and second we must modify our recomputing function to take into account this symmetry and the particular way the directions are stored in the object:

# function doing the recalculations
recompute_complex_cov_anis = function(cv){
  # split the gstatVariogram structure in the individual vgrams
  aux = split(cv[, -6], cv$id)
  # ad-hoc function taking two vgrams and operating their gamma column
  sumGamma = function(x,y, alpha=1){
    y$gamma = x$gamma + alpha*y$gamma
    return(y) # this time return the second argument!
  }
  N = nrow(aux[[1]])/2 
  # compute C^{Im} and C^{Re}
  aaxx = list(Im=sumGamma(aux$z1.z2[-(1:N),], aux$z1.z2[(1:N),], -1), 
              Re=sumGamma(aux$z1[1:N,], aux$z2[1:N,])
              )
  # undo the split
  f = rep(c("Im", "Re"), each=N)
  res = unsplit(aaxx, f)
  res$id = as.factor(f)
  # restore the class and return 
  class(res) = class(cv)
  return(res)
}

theta.cvg %>% recompute_complex_cov_anis() %>% plot

Although the outcome is quite satisfactory, the code is currently tricky to use, as it requires for instance that one computes covariograms for a whole set of directions over the 360\(^o\), which is not common.

A tailored function

As a consequence, we should produce a tailored function that takes responsibility over all these details

circularCovariogram = function(g, dirs=4){
  # case dirs is only the number of desired directions
  if(length(dirs)==1){
    delta = 180/dirs
    dirs = delta*(0:(dirs-1))
  }
  # extend the nr of directions to the whole circle
  dirs = c(dirs, 180 + dirs)
  # compute the covariogram
  cvg = gmGeostats::variogram(g, covariogram=TRUE, alpha=dirs)
  # recast and set the new class label
  rcvg = recompute_complex_cov_anis(cvg)
  class(rcvg) = c("circularCovariogram", class(cvg))
  return(rcvg)
}

The goal of extending the class marker with a specific class for this type of structural function is to be able to produce later on specific fitting behaviors adapted to this kind of data, like the one presented in De Iaco et al (2013). This will be discussed in the following sections. In the meantime, what we need is a way to recast this class back to “gstatVariogram”, to keep compatibility. For this, we just need to create a method for the function “as.gstatVariogram”, as well as register our new class as a subclass of the abstract class “EmpiricalStructuralFunctionSpecification”:

# which arguments has as.gstatVariogram?
args(as.gstatVariogram)
#> function (vgemp, ...) 
#> NULL
# new method:
as.gstatVariogram.circularCovariogram = function(vgemp, ...){
  class(vgemp) = class(vgemp)[-1] # drop the extra class marker
  return(vgemp) # return result
}
# export the ad-hoc S3 class to S4
setOldClass("circularCovariogram")
# declare the new class an "EmpiricalStructuralFunctionSpecification"
setIs("circularCovariogram", "EmpiricalStructuralFunctionSpecification")

# check that all went right:
theta.cvg = circularCovariogram(theta.gg, dirs = 6)
is(theta.cvg, "circularCovariogram")
#> [1] TRUE
is(theta.cvg, "gstatVariogram")
#> [1] TRUE
is(theta.cvg, "data.frame")
#> [1] TRUE
is(theta.cvg, "EmpiricalStructuralFunctionSpecification")
#> [1] TRUE

Future work

In future extensions of this vignette we will discuss the way to fit covariance models to circular covariograms, setup estimation models/methods adapted to the nature of the data, and show other examples of how to register them to the package (usage of setIs() and coercion in conjunction with the abstract classes mentioned, validate()- and predict()-methods, creation of own make.gm****Model() data containers, etc). We will continue with our illustrative example of circular data, using developments by Wackernagel (2003) and de Iaco et al (2013).

References

Aitchison, J. (1986) The Statistical Analysis of Compositional Data Chapman & Hall Ltd., London (UK). (Reprinted in 2003 with additional material by The Blackburn Press)

Boogaart, K. G. v. d.; Tolosana-Delgado, R. (2013) Analysing compositional data with R, Springer, Heidelberg

Boogaart, K.G. v.d.; Tolosana-Delgado, Raimon; Bren, Matevz (2021). compositions: Compositional Data Analysis. R package version 2.0-2. http://www.stat.boogaart.de/compositions/

De Iaco, S.; Posa, D.; Palma, M. (2013) Complex-Valued Random Fields for Vectorial Data: Estimating and Modeling Aspects. Mathematical Geosciences 45: 557???573

Stevens, S. (1946) On the theory of scales of measurement. Science 103: 677-680

Wackernagel, H. (2003) Multivariate geostatistics???an introduction with applications, 3rd edn. Springer, Berlin