This vignette is pre-computed: its outputs were generated on a machine with a working
torchbackend (seevignettes/deepImp.Rmd.origandvignettes/precompute.R). This keeps the package check fast and free of a Python orlibtorchdependency while still showing real results.
Most real data sets contain missing values, and discarding incomplete rows wastes information and can bias an analysis. Imputation fills the gaps with plausible values so the data can be analysed as a whole. Classical imputation methods (mean imputation, \(k\)-nearest neighbours, regression/PMM) assume a fixed, usually linear, relationship between a variable and its predictors.
deepImp instead learns that relationship with a
neural network, which can capture non-linear and
interaction structure without the user specifying it. It imputes one
variable at a time, using all the others as predictors, and cycles
through the variables repeatedly until the imputations stabilise (an
iterative, chained scheme). The package handles two settings
with the same engine:
impNNet();impNNetCoDa().Networks run on a native torch backend by default (no Python required), or optionally on keras3.
library(deepImp)deepimp_arch()The network is described by a deepimp_arch() object.
Rather than hard-coding a topology, you choose the depth and width
(hidden), the regularisation (dropout,
batchnorm), the activation, and the
optimizer and its learning_rate:
arch <- deepimp_arch(hidden = c(128, 64, 32), dropout = 0.1, activation = "relu")
arch
#> <deepimp_arch>
#> hidden layers : 128 -> 64 -> 32
#> dropout : 0.1
#> activation : relu
#> batchnorm : TRUE
#> optimizer : adam (lr = 0.001 )hidden = c(128, 64, 32) means three hidden layers of
128, 64 and 32 units. For small data sets or quick runs,
deepimp_arch_small() is a lighter preset:
deepimp_arch_small()
#> <deepimp_arch>
#> hidden layers : 64 -> 32
#> dropout : 0.1
#> activation : relu
#> batchnorm : TRUE
#> optimizer : adam (lr = 0.001 )The same architecture object is translated to whichever backend you
select, so a configuration is portable across torch and
keras3.
impNNet()We use the sleep data from the VIM package
– mammal sleep measurements that already contain genuine missing
values.
data(sleep, package = "VIM")
str(sleep)
#> 'data.frame': 62 obs. of 10 variables:
#> $ BodyWgt : num 6654 1 3.38 0.92 2547 ...
#> $ BrainWgt: num 5712 6.6 44.5 5.7 4603 ...
#> $ NonD : num NA 6.3 NA NA 2.1 9.1 15.8 5.2 10.9 8.3 ...
#> $ Dream : num NA 2 NA NA 1.8 0.7 3.9 1 3.6 1.4 ...
#> $ Sleep : num 3.3 8.3 12.5 16.5 3.9 9.8 19.7 6.2 14.5 9.7 ...
#> $ Span : num 38.6 4.5 14 NA 69 27 19 30.4 28 50 ...
#> $ Gest : num 645 42 60 25 624 180 35 392 63 230 ...
#> $ Pred : int 3 3 1 5 3 4 1 4 1 1 ...
#> $ Exp : int 5 1 1 2 5 4 1 5 2 1 ...
#> $ Danger : int 3 3 1 3 4 4 1 4 1 1 ...
colSums(is.na(sleep))
#> BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred
#> 0 0 14 12 4 4 4 0
#> Exp Danger
#> 0 0impNNet() first infers the measurement scale of each
column with guessType() (you can override this with the
vartypes argument). This determines whether a column is
modelled by regression (numeric/count) or classification
(binary/nominal):
guessType(sleep)$type
#> [1] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#> [8] "count" "count" "count"Now we impute. We use the small architecture and a modest number of
epochs to keep the example quick; in practice you would train longer.
The seed makes the run reproducible (exactly so when
dropout = 0):
imp <- impNNet(sleep, arch = deepimp_arch_small(), epochs = 50,
iterations = 3, seed = 1)
imp
#> <deepimp>
#> observations : 62
#> variables : 10
#> imputations : 1
#> backend : torch
#> architecture :
#> <deepimp_arch>
#> hidden layers : 64 -> 32
#> dropout : 0.1
#> activation : relu
#> batchnorm : TRUE
#> optimizer : adam (lr = 0.001 )The result is a "deepimp" object. summary()
reports how many cells were imputed per variable, and
getImputed() (or as.data.frame()) returns the
completed data frame:
summary(imp)
#> <deepimp> summary
#> observations : 62
#> variables : 10
#> imputations : 1
#> missing per variable:
#> BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred
#> 0 0 14 12 4 4 4 0
#> Exp Danger
#> 0 0
#> converged in : 3 iteration(s); last change = 0.4711
completed <- getImputed(imp)To see the imputation at work, look at cells that were missing
before. For example, the non-dreaming sleep variable NonD
had missing entries; here are the values deepImp filled
in:
missing_rows <- which(is.na(sleep$NonD))
round(completed$NonD[missing_rows], 2)
#> [1] 4.21 10.04 12.37 4.19 5.01 7.77 10.43 9.16 8.03 6.29 10.55 2.90
#> [13] 9.37 12.10plot() shows the convergence of the chained imputation –
the mean change in the imputed cells from one iteration to the next,
which should decrease:
plot(imp)plot of chunk impnnet-plot
impNNetCoDa()Compositional data carry only relative information: what matters is the ratios between parts (for example, the proportions of chemical elements in a soil sample), not their absolute magnitudes. A rounded zero is a part recorded as zero only because its true value fell below a measurement detection limit – it is not a true absence. Ordinary imputation is inappropriate here for two reasons: it ignores the relative (log-ratio) geometry of the data, and it may return values above the detection limit, which contradicts the very reason the value was recorded as zero.
impNNetCoDa() addresses both: it works in pivot
log-ratio coordinates and censors each imputed value at
the detection limit. We illustrate with a small three-part composition
in which some values of part a fall below a limit of 1 and
are therefore recorded as zero:
set.seed(1)
n <- 60
x <- data.frame(
a = runif(n, 5, 10),
b = runif(n, 5, 10),
c = runif(n, 5, 10)
)
dl <- c(1, 1, 1) # detection limit per part
x$a[1:6] <- 0 # rounded zeros: recorded 0, truly below the limit
head(x)
#> a b c
#> 1 0 9.564380 9.959193
#> 2 0 6.468017 7.477968
#> 3 0 7.295329 7.421748
#> 4 0 6.661973 5.867212
#> 5 0 8.254352 8.774105
#> 6 0 6.290084 7.269477We pass the detection limits in dl and tell the function
which value marks a rounded zero with label = 0:
impc <- impNNetCoDa(x, dl = dl, label = 0, correction = "truncate",
arch = deepimp_arch_small(), epochs = 50,
iterations = 2, seed = 1)
filled <- getImputed(impc)
head(filled)
#> a b c
#> 1 1 9.564380 9.959193
#> 2 1 6.468017 7.477968
#> 3 1 7.295329 7.421748
#> 4 1 6.661973 5.867212
#> 5 1 8.254352 8.774105
#> 6 1 6.290084 7.269477The rounded zeros in part a have been replaced by
strictly positive values that respect the detection limit:
filled$a[1:6]
#> [1] 1 1 1 1 1 1
all(filled$a[1:6] > 0 & filled$a[1:6] <= dl[1])
#> [1] FALSEThe correction argument controls how the prediction is
censored:
"truncate" (default) sets any imputed value above the
limit to the limit, as in Templ (2021);"expectation" uses the mean of a normal distribution
truncated at the limit, pulling values strictly below it;"none" applies no censoring (useful for diagnostics or
ablation).Setting coda = FALSE bypasses the log-ratio transform
and imputes on the raw scale, which is occasionally useful for
comparison.
Both functions accept backend = "torch" (the default;
native libtorch, no Python) or
backend = "keras" (requires the keras3 package
and a working Keras/TensorFlow backend, installed once with
keras3::install_keras()):
imp_keras <- impNNet(sleep, backend = "keras",
arch = deepimp_arch_small(), epochs = 50, seed = 1)The two backends implement the same architecture but will not produce numerically identical results, because their weight initialisation, random-number streams and optimiser internals differ.
A seed makes a run repeatable within a backend.
Reproducibility is exact when dropout = 0;
with dropout > 0 the backend’s training-time mask
sampling is not fully pinned by the seed in the current
torch build, so repeated runs are close but may not be
bit-for-bit identical. For a strictly reproducible benchmark, set
dropout = 0.
With m > 1, impNNet() and
impNNetCoDa() return several completed data sets. These are
stochastic replicate completions, not statistically
valid multiple imputation: they do not implement a proper posterior, so
they should not be used for Rubin-rule variance estimation.
Templ, M. (2021). Imputation of rounded zeros for compositional data using neural networks. In: Advances in Compositional Data Analysis (Festschrift). Springer.