Neural-network imputation with deepImp

This vignette is pre-computed: its outputs were generated on a machine with a working torch backend (see vignettes/deepImp.Rmd.orig and vignettes/precompute.R). This keeps the package check fast and free of a Python or libtorch dependency while still showing real results.

Why neural-network imputation?

Most real data sets contain missing values, and discarding incomplete rows wastes information and can bias an analysis. Imputation fills the gaps with plausible values so the data can be analysed as a whole. Classical imputation methods (mean imputation, \(k\)-nearest neighbours, regression/PMM) assume a fixed, usually linear, relationship between a variable and its predictors.

deepImp instead learns that relationship with a neural network, which can capture non-linear and interaction structure without the user specifying it. It imputes one variable at a time, using all the others as predictors, and cycles through the variables repeatedly until the imputations stabilise (an iterative, chained scheme). The package handles two settings with the same engine:

Networks run on a native torch backend by default (no Python required), or optionally on keras3.

library(deepImp)

The architecture: deepimp_arch()

The network is described by a deepimp_arch() object. Rather than hard-coding a topology, you choose the depth and width (hidden), the regularisation (dropout, batchnorm), the activation, and the optimizer and its learning_rate:

arch <- deepimp_arch(hidden = c(128, 64, 32), dropout = 0.1, activation = "relu")
arch
#> <deepimp_arch>
#>   hidden layers : 128 -> 64 -> 32 
#>   dropout       : 0.1 
#>   activation    : relu 
#>   batchnorm     : TRUE 
#>   optimizer     : adam (lr = 0.001 )

hidden = c(128, 64, 32) means three hidden layers of 128, 64 and 32 units. For small data sets or quick runs, deepimp_arch_small() is a lighter preset:

deepimp_arch_small()
#> <deepimp_arch>
#>   hidden layers : 64 -> 32 
#>   dropout       : 0.1 
#>   activation    : relu 
#>   batchnorm     : TRUE 
#>   optimizer     : adam (lr = 0.001 )

The same architecture object is translated to whichever backend you select, so a configuration is portable across torch and keras3.

Imputing mixed-type data with impNNet()

We use the sleep data from the VIM package – mammal sleep measurements that already contain genuine missing values.

data(sleep, package = "VIM")
str(sleep)
#> 'data.frame':    62 obs. of  10 variables:
#>  $ BodyWgt : num  6654 1 3.38 0.92 2547 ...
#>  $ BrainWgt: num  5712 6.6 44.5 5.7 4603 ...
#>  $ NonD    : num  NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
#>  $ Dream   : num  NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
#>  $ Sleep   : num  3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
#>  $ Span    : num  38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
#>  $ Gest    : num  645 42 60 25 624 180 35 392 63 230 ...
#>  $ Pred    : int  3 3 1 5 3 4 1 4 1 1 ...
#>  $ Exp     : int  5 1 1 2 5 4 1 5 2 1 ...
#>  $ Danger  : int  3 3 1 3 4 4 1 4 1 1 ...
colSums(is.na(sleep))
#>  BodyWgt BrainWgt     NonD    Dream    Sleep     Span     Gest     Pred 
#>        0        0       14       12        4        4        4        0 
#>      Exp   Danger 
#>        0        0

impNNet() first infers the measurement scale of each column with guessType() (you can override this with the vartypes argument). This determines whether a column is modelled by regression (numeric/count) or classification (binary/nominal):

guessType(sleep)$type
#>  [1] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [8] "count"   "count"   "count"

Now we impute. We use the small architecture and a modest number of epochs to keep the example quick; in practice you would train longer. The seed makes the run reproducible (exactly so when dropout = 0):

imp <- impNNet(sleep, arch = deepimp_arch_small(), epochs = 50,
               iterations = 3, seed = 1)
imp
#> <deepimp>
#>   observations : 62
#>   variables    : 10
#>   imputations  : 1
#>   backend      : torch
#>   architecture :
#> <deepimp_arch>
#>   hidden layers : 64 -> 32 
#>   dropout       : 0.1 
#>   activation    : relu 
#>   batchnorm     : TRUE 
#>   optimizer     : adam (lr = 0.001 )

The result is a "deepimp" object. summary() reports how many cells were imputed per variable, and getImputed() (or as.data.frame()) returns the completed data frame:

summary(imp)
#> <deepimp> summary
#>   observations : 62
#>   variables    : 10
#>   imputations  : 1
#>   missing per variable:
#>  BodyWgt BrainWgt     NonD    Dream    Sleep     Span     Gest     Pred 
#>        0        0       14       12        4        4        4        0 
#>      Exp   Danger 
#>        0        0 
#>   converged in : 3 iteration(s); last change = 0.4711
completed <- getImputed(imp)

To see the imputation at work, look at cells that were missing before. For example, the non-dreaming sleep variable NonD had missing entries; here are the values deepImp filled in:

missing_rows <- which(is.na(sleep$NonD))
round(completed$NonD[missing_rows], 2)
#>  [1]  4.21 10.04 12.37  4.19  5.01  7.77 10.43  9.16  8.03  6.29 10.55  2.90
#> [13]  9.37 12.10

plot() shows the convergence of the chained imputation – the mean change in the imputed cells from one iteration to the next, which should decrease:

plot(imp)

plot of chunk impnnet-plot

Compositional data and rounded zeros with impNNetCoDa()

Compositional data carry only relative information: what matters is the ratios between parts (for example, the proportions of chemical elements in a soil sample), not their absolute magnitudes. A rounded zero is a part recorded as zero only because its true value fell below a measurement detection limit – it is not a true absence. Ordinary imputation is inappropriate here for two reasons: it ignores the relative (log-ratio) geometry of the data, and it may return values above the detection limit, which contradicts the very reason the value was recorded as zero.

impNNetCoDa() addresses both: it works in pivot log-ratio coordinates and censors each imputed value at the detection limit. We illustrate with a small three-part composition in which some values of part a fall below a limit of 1 and are therefore recorded as zero:

set.seed(1)
n <- 60
x <- data.frame(
  a = runif(n, 5, 10),
  b = runif(n, 5, 10),
  c = runif(n, 5, 10)
)
dl <- c(1, 1, 1)          # detection limit per part
x$a[1:6] <- 0             # rounded zeros: recorded 0, truly below the limit
head(x)
#>   a        b        c
#> 1 0 9.564380 9.959193
#> 2 0 6.468017 7.477968
#> 3 0 7.295329 7.421748
#> 4 0 6.661973 5.867212
#> 5 0 8.254352 8.774105
#> 6 0 6.290084 7.269477

We pass the detection limits in dl and tell the function which value marks a rounded zero with label = 0:

impc <- impNNetCoDa(x, dl = dl, label = 0, correction = "truncate",
                    arch = deepimp_arch_small(), epochs = 50,
                    iterations = 2, seed = 1)
filled <- getImputed(impc)
head(filled)
#>   a        b        c
#> 1 1 9.564380 9.959193
#> 2 1 6.468017 7.477968
#> 3 1 7.295329 7.421748
#> 4 1 6.661973 5.867212
#> 5 1 8.254352 8.774105
#> 6 1 6.290084 7.269477

The rounded zeros in part a have been replaced by strictly positive values that respect the detection limit:

filled$a[1:6]
#> [1] 1 1 1 1 1 1
all(filled$a[1:6] > 0 & filled$a[1:6] <= dl[1])
#> [1] FALSE

The correction argument controls how the prediction is censored:

Setting coda = FALSE bypasses the log-ratio transform and imputes on the raw scale, which is occasionally useful for comparison.

Choosing a backend

Both functions accept backend = "torch" (the default; native libtorch, no Python) or backend = "keras" (requires the keras3 package and a working Keras/TensorFlow backend, installed once with keras3::install_keras()):

imp_keras <- impNNet(sleep, backend = "keras",
                     arch = deepimp_arch_small(), epochs = 50, seed = 1)

The two backends implement the same architecture but will not produce numerically identical results, because their weight initialisation, random-number streams and optimiser internals differ.

Reproducibility

A seed makes a run repeatable within a backend. Reproducibility is exact when dropout = 0; with dropout > 0 the backend’s training-time mask sampling is not fully pinned by the seed in the current torch build, so repeated runs are close but may not be bit-for-bit identical. For a strictly reproducible benchmark, set dropout = 0.

A note on multiple completions

With m > 1, impNNet() and impNNetCoDa() return several completed data sets. These are stochastic replicate completions, not statistically valid multiple imputation: they do not implement a proper posterior, so they should not be used for Rubin-rule variance estimation.

Reference

Templ, M. (2021). Imputation of rounded zeros for compositional data using neural networks. In: Advances in Compositional Data Analysis (Festschrift). Springer.