The pls() function offers some very basic approaches for handling missing values in the data, specified via the missing argument. Currently, there are three options.
missing = "listwise")missing = "mean")missing = "kNN")The last two options are single imputation approaches. The pls() function does not currently offer any multiple imputation approaches, but we show how this can be done by the user itself, using the mice package, at the end of the vignette.
With missing="listwise" (the default) any observation (i.e., a row) containing missing values for the variables used in the model are removed. Here we can see an example.
model <- "Survived ~ Age + Female + Age:Female"
fit <- pls(model, data = titanic, missing = "listwise", ordered = "Survived")With missing="mean" missing values are imputed with (univariate) expected values. For continous values missing values are imputed using the mean. For ordinal variables with more than two categories, missing values are imputed with the median. For binary ordered variables missing values are imputed with the mode.
In our example, missing values in Age are imputed with the mean of age. Both Survived and Female are binary variables, where the missing values get imputed with the most common value.
model <- "Survived ~ Age + Female + Age:Female"
fit <- pls(model, data = titanic, missing = "mean", ordered = "Survived")With missing="kNN" missing values are imputed by finding the k nearest (complete data) neighbors of an observation with missing data. The values of the values of the k neighbors are then aggregated using either the mean, median or the mode, depending on the data type of the variable. The k number of neighbors to be used, can be specified using the knn.k argument.
model <- "Survived ~ Age + Female + Age:Female"
fit <- pls(model, data = titanic, missing = "kNN",
ordered = "Survived", knn.k = 5) # use the 5 nearest neighborsMultiple imputation cannot be performed just using the pls() function, but it can be performed using other available multiple imputation packages in R. Here we use the mice package, but other packages can be used as well (e.g., the Amelia package).
library(mice)
m <- 20 # Number of imputations
vars <- c("Survived", "Age", "Female") # Variables to impute/use in the analysis
imputations <- mice(titanic[vars], m = m)
COEF <- NULL # Matrix with estimated coefficients for each imputation
BOOT <- NULL # Matrix with all the bootstraps from all imputations
model <- "Survived ~ Age + Female + Age:Female"
for (i in seq_len(m)) {
fit.i <- pls(model, data = complete(imputations, i), # get the ith imputation
ordered = "Survived",
bootstrap = TRUE,
boot.R = 100,
boot.parallel = "multicore", # Use parallel bootstrap
boot.ncpus = 2L)
COEF <- rbind(COEF, coef(fit.i))
BOOT <- rbind(BOOT, boot(fit.i))
}
apply(COEF, MARGIN = 2, FUN = mean) # Mean estimate across imputations
apply(BOOT, MARGIN = 2, FUN = sd) # Standard errors