Preface - Note from the author

The motivation to create the SVEMnet package was primarily to have a personal sandbox to explore SVEM performance in different scenarios and with various modifications to its structure. As noted in the documentation, I used GPT o1-preview to help form the code structure of the package and to code the Roxygen structure of the documentation. I have subsequently used more recent versions for auditing. The SVEM significance test R code comes from the supplementary material of Karl (2024). I wrote that code by hand and validated each step (not including the creation of the SVEM predictions) against corresponding results in JMP (the supplementary material of Karl (2024) provides the matching JSL script). For the SVEMnet() code, assuming only a single value of alpha for glmnet is being tested, the heart of the SVEM code is simply

#partial code for illustration of the SVEM loop
coef_matrix <- matrix(NA, nrow = nBoot, ncol = p + 1)
 for (i in 1:nBoot) {
      U <- runif(n)
      w_train <- -log(U)
      w_valid <- -log(1 - U)
      #match glmnet normalization of training weight vector
      w_train <- w_train * (n / sum(w_train))
      w_valid <- w_valid * (n / sum(w_valid))
      glmnet(
          X, y_numeric,
          alpha = alpha,
          weights = w_train,
          intercept = TRUE,
          standardize = standardize,
          maxit = 1e6,
          nlambda = 500
      )
      predict(fit, newx = X)
      val_errors <- colSums(w_valid * (y_numeric - pred_valid)^2)
      k_values <- fit$df
      n_obs <- length(y_numeric)
      aic_values <- n_obs * log(val_errors / n_obs) + 2 * k_values
         # Choose lambda
      if (objective == "wSSE") {
        idx_min <- which.min(val_errors)
        lambda_opt <- fit$lambda[idx_min]
        val_error <- val_errors[idx_min]
      } else if (objective == "wAIC") {
        idx_min <- which.min(aic_values)
        lambda_opt <- fit$lambda[idx_min]
        val_error <- aic_values[idx_min]
      }
      coef_matrix[i, ] <- as.vector(coef(fit, s = lambda_opt))
}

However, to get this to a stable implementation that includes error and warning handling and structure to pass to S3 methods for predict(), coef(), plot(), etc, it was only practical for me to utilize help from GPT o1-preview. I simply would not have taken the time to add that structure otherwise, and my implementation would have been inferior. I reviewed any of the code that was generated from this tool before integrating it, and corrected its occasional mistakes. If someone would like to create a purely human-written set of code for a similar purpose, let me know and I will be happy to add links to your package and a description to the SVEMnet documentation.

Later revisions make use of later versions of GPT for code auditing, stress testing, and simulaiton. Many of the later entries in this vignette were written with GPT (code, analysis, summary).

SVEMnet Example 1

library(SVEMnet)

# Example data
data <- iris
svem_model <- SVEMnet(Sepal.Length ~ ., data = data, relaxed=FALSE,glmnet_alpha=c(1),nBoot = 50)
coef(svem_model)

##                   Percent of Bootstraps Nonzero
## Sepal.Width                                 100
## Petal.Length                                100
## Petal.Width                                  96
## Speciesversicolor                            92
## Speciesvirginica                             90

Generate a plot of actual versus predicted values:

plot(svem_model)

Predict outcomes for new data using the predict() function:

predictions <- predict(svem_model, data)
print(predictions)

##   [1] 5.000157 4.738925 4.767222 4.867366 5.052403 5.379482 4.918787 5.024105
##   [9] 4.686678 4.896489 5.180845 5.100301 4.768047 4.539462 5.108998 5.488324
##  [17] 5.074701 4.971034 5.356359 5.203968 5.176496 5.122599 4.747622 5.036881
##  [25] 5.328886 4.891315 5.042055 5.076352 4.947910 4.995808 4.943561 4.965860
##  [33] 5.418953 5.365881 4.867366 4.691027 4.923961 5.081526 4.662729 5.024105
##  [41] 4.894839 4.267882 4.767222 5.036056 5.479626 4.709802 5.309286 4.843417
##  [49] 5.180845 4.895664 6.479761 6.298248 6.550782 5.505300 6.165458 6.147508
##  [57] 6.473762 5.111549 6.275950 5.608968 5.054953 5.965170 5.540422 6.323022
##  [65] 5.513998 6.198929 6.193755 5.877849 5.775784 5.591843 6.439465 5.766532
##  [73] 6.237304 6.329021 6.047364 6.146683 6.346971 6.516486 6.141509 5.368432
##  [81] 5.463402 5.416329 5.667213 6.465064 6.193755 6.373618 6.398392 5.810081
##  [89] 5.947220 5.609793 5.995943 6.299073 5.691162 5.059302 5.866676 6.052538
##  [97] 5.971169 6.047364 4.906087 5.842727 6.965101 6.140603 6.848658 6.655194
## [105] 6.743340 7.382024 5.637184 7.188561 6.598599 7.198035 6.372712 6.292993
## [113] 6.543877 5.930792 6.047235 6.437734 6.631245 7.847068 7.343378 5.919667
## [121] 6.742515 6.011336 7.382849 6.017335 6.853007 7.116714 5.993386 6.174074
## [129] 6.515579 6.918077 6.954801 7.676728 6.486456 6.309340 6.614946 6.942802
## [137] 6.741689 6.683492 6.097879 6.519928 6.584950 6.233097 6.140603 6.894905
## [145] 6.736515 6.257046 5.959914 6.344414 6.618422 6.326465

Whole Model Significance Testing

This is the serial version of the significance test. It is slower but the code is less complicated to read than the faster parallel version.

test_result <- svem_significance_test(Sepal.Length ~ ., data = data)
print(test_result)
plot(test_result)
SVEM Significance Test p-value:
[1] 0

Whole model test result

Note that there is a parallelized version that runs much faster

test_result <- svem_significance_test_parallel(Sepal.Length ~ ., data = data)
print(test_result)
plot(test_result)
SVEM Significance Test p-value:
[1] 0

SVEMnet Example 2

# Simulate data
set.seed(1)
n <- 25
X1 <- runif(n)
X2 <- runif(n)
X3 <- runif(n)
X4 <- runif(n)
X5 <- runif(n)

#y only depends on X1 and X2
y <- 1 + X1 +  X2 + X1 * X2 + X1^2 + rnorm(n)
data <- data.frame(y, X1, X2, X3, X4, X5)

# Perform the SVEM significance test
test_result <- svem_significance_test_parallel(
  y ~ (X1 + X2 + X3)^2 + I(X1^2) + I(X2^2) + I(X3^2),
  data = data

)

# View the p-value
print(test_result)
SVEM Significance Test p-value:
[1] 0.009399093


test_result2 <- svem_significance_test_parallel(
  y ~ (X1 + X2 )^2 + I(X1^2) + I(X2^2),
  data = data
)

# View the p-value
print(test_result2)
SVEM Significance Test p-value:
[1] 0.006475736

#note that the response does not depend on X4 or X5
test_result3 <- svem_significance_test_parallel(
  y ~ (X4 + X5)^2 + I(X4^2) + I(X5^2),
  data = data
)

# View the p-value
print(test_result3)
SVEM Significance Test p-value:
[1] 0.8968502

# Plot the Mahalanobis distances
plot(test_result,test_result2,test_result3)

Whole Model Test Results for Example 2

21DEC2024: Add glmnet.cv wrapper

Newly added wrapper for cv.glmnet() to compare performance of SVEM to glmnet’s native CV implementation.

08SEP2025: Added relaxed option

Simulations show improved behavior from a relaxed grid search that allows the model to apply a lighter penalty to parameteres retained from the initial elastic net fit. This option tends to hurt RMSE on holdout data for cross validated glmnet, but the SVEM bootstraps average over the addtional variability introduced by this option and produce smaller RMSE on holdout data.

25SEP2025

1. Introduction

SVEMnet implements Self-Validated Ensemble Models (SVEM) using elastic-net (lasso/ridge) base learners via glmnet. SVEM averages predictions from bootstrap-resampled fits chosen by an internal validation scheme, and exposes helpers for (i) deterministic factor-space expansion, (ii) whole-model significance testing, and (iii) multi-response random-search optimization with optional mixture constraints.

This vignette walks through a minimal workflow and ends with an end-to-end lipid formulation example across three responses with a mixture constraint and an optimization step.

2. A small example

# Simulate simple mixed-type data
set.seed(1)
n  <- 120
X1 <- runif(n)
X2 <- runif(n)
F  <- factor(sample(c("lo","hi"), n, replace = TRUE))
y1 <- 1 + 1.5*X1 - 0.8*X2 + 0.4*(F=="hi") + rnorm(n, 0, 0.2)
y2 <- 0.7 + 0.4*X1 + 0.4*X2 - 0.2*(F=="hi") + rnorm(n, 0, 0.2)
dat <- data.frame(y1, y2, X1, X2, F)

# Fit two SVEM models (keep defaults modest in vignettes)
m1 <- SVEMnet(y1 ~ X1 + X2 + F, dat, nBoot = 30)
m2 <- SVEMnet(y2 ~ X1 + X2 + F, dat, nBoot = 30)

# Predict
head(predict(m1, newdata = dat))
head(predict(m2, newdata = dat))

3. Whole-model significance tests (serial)

The serial significance test draws many evaluation points in the factor space, refits SVEM repeatedly on original and permuted responses, and compares distance distributions via a parametric reference fit.

res_serial_y1 <- svem_significance_test(
  y1 ~ X1 + X2 + F, dat,
  nPoint = 2000, nSVEM = 10, nPerm = 150,
  nBoot = 80, glmnet_alpha = 1, relaxed = FALSE,
  verbose = TRUE
)
res_serial_y1$p_value

4. Random-search optimization with diverse candidates

Use svem_optimize_random() to define goals per response and pick a best recipe plus k_candidates diverse high scorers (PAM medoids; real sampled rows).

objs  <- list(y1 = m1, y2 = m2)
goals <- list(
  y1 = list(goal = "max",    weight = 0.6),
  y2 = list(goal = "target", weight = 0.4, target = 0.9)
)

opt_out <- svem_optimize_random(
  objects      = objs,
  goals        = goals,
  n            = 3000,
  agg          = "mean",
  debias       = FALSE,
  ci           = TRUE,
  level        = 0.95,
  k_candidates = 5,
  top_frac     = 0.02,
  verbose      = TRUE
)

opt_out$best_x
opt_out$best_pred
head(opt_out$candidates)

5. End-to-end lipid formulation example

In this section we use the bundled lipid_screen dataset and demonstrate the complete flow over three responses (Potency, Size, PDI) with a mixture constraint on composition.

Key columns

Composition fractions in proportion units: PEG, Helper, Ionizable, Cholesterol
Categorical: Ionizable_Lipid_Type
Numeric process: N_P_ratio, flow_rate
Responses: Potency, Size, PDI

Mixture constraints

The four composition variables must sum to 1
Bounds: PEG ∈ [0.01, 0.05], and Helper, Ionizable, Cholesterol ∈ [0.10, 0.60]

5.1 Build and lock a shared expansion

We create a deterministic expansion with bigexp_terms() (up to 3-way interactions here) and reuse it for each response via bigexp_formula().

data(lipid_screen)
str(lipid_screen)

spec <- bigexp_terms(
  Potency ~ PEG + Helper + Ionizable + Cholesterol +
    Ionizable_Lipid_Type + N_P_ratio + flow_rate,
  data               = lipid_screen,
  factorial_order    = 3,
  include_pure_cubic = TRUE,
  include_pc_3way    = FALSE
)

form_pot <- bigexp_formula(spec, "Potency")
form_siz <- bigexp_formula(spec, "Size")
form_pdi <- bigexp_formula(spec, "PDI")

5.2 Fit the three SVEM models

Keep debias = FALSE in examples per our package convention.

set.seed(1)
fit_pot <- SVEMnet(form_pot, lipid_screen, nBoot = 40)
fit_siz <- SVEMnet(form_siz, lipid_screen, nBoot = 40)
fit_pdi <- SVEMnet(form_pdi, lipid_screen, nBoot = 40)

objs <- list(Potency = fit_pot, Size = fit_siz, PDI = fit_pdi)

5.3 Whole-model significance tests

5.3.1 Serial per response

test_pot <- svem_significance_test(form_pot, lipid_screen, nPoint=2000, nSVEM=10, nPerm=150,
                                   nBoot=80, glmnet_alpha=1, relaxed=FALSE, verbose=TRUE)
test_siz <- svem_significance_test(form_siz, lipid_screen, nPoint=2000, nSVEM=10, nPerm=150,
                                   nBoot=80, glmnet_alpha=1, relaxed=FALSE, verbose=TRUE)
test_pdi <- svem_significance_test(form_pdi, lipid_screen, nPoint=2000, nSVEM=10, nPerm=150,
                                   nBoot=80, glmnet_alpha=1, relaxed=FALSE, verbose=TRUE)

c(Potency = test_pot$p_value, Size = test_siz$p_value, PDI = test_pdi$p_value)

5.3.2 Parallel per response (run separately, then plot together)

svem_significance_test_parallel() operates on one response at a time. Use it separately for each response and then combine the visuals with plot.svem_significance_test() by passing multiple results.

# Parallel runs (example; adjust nCore to your machine)
par_pot <- svem_significance_test_parallel(form_pot, lipid_screen,
                                           nPoint=3000, nSVEM=10, nPerm=150,
                                           nCore = max(1L, parallel::detectCores()-1L),
                                           seed = 123, verbose=TRUE)
par_siz <- svem_significance_test_parallel(form_siz, lipid_screen,
                                           nPoint=3000, nSVEM=10, nPerm=150,
                                           nCore = max(1L, parallel::detectCores()-1L),
                                           seed = 123, verbose=TRUE)
par_pdi <- svem_significance_test_parallel(form_pdi, lipid_screen,
                                           nPoint=3000, nSVEM=10, nPerm=150,
                                           nCore = max(1L, parallel::detectCores()-1L),
                                           seed = 123, verbose=TRUE)

# Plot all three together
plot(par_pot, par_siz, par_pdi, labels = c("Potency","Size","PDI"))

5.4 Random-search optimization with mixture constraints

We construct the mixture constraint and optimize a weighted multi-response goal, returning the best design and 5 diverse high-score existing rows (PAM medoids).

goals <- list(
  Potency = list(goal = "max", weight = 0.7),
  Size    = list(goal = "min", weight = 0.2),
  PDI     = list(goal = "min", weight = 0.1)
)

mixL <- list(list(
  vars  = c("Cholesterol","PEG","Ionizable","Helper"),
  lower = c(0.10, 0.01, 0.10, 0.10),
  upper = c(0.60, 0.05, 0.60, 0.60),
  total = 1
))

opt_out <- svem_optimize_random(
  objects        = objs,
  goals          = goals,
  n              = 10000,
  mixture_groups = mixL,
  agg            = "mean",
  debias         = FALSE,
  ci             = TRUE,
  level          = 0.95,
  k_candidates   = 5,
  top_frac       = 0.01,
  verbose        = TRUE
)

opt_out$best_x
opt_out$best_pred
opt_out$best_ci
opt_out$candidates

References and Citations

Lemkus, T., Gotwalt, C., Ramsey, P., & Weese, M. L. (2021). Self-Validated Ensemble Models for Elastic Net Regression.
Chemometrics and Intelligent Laboratory Systems, 219, 104439.
DOI: 10.1016/j.chemolab.2021.104439
Karl, A. T. (2024). A Randomized Permutation Whole-Model Test for SVEM.
Chemometrics and Intelligent Laboratory Systems, 249, 105122.
DOI: 10.1016/j.chemolab.2024.105122
Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent.
Journal of Statistical Software, 33(1), 1–22.
DOI: 10.18637/jss.v033.i01
Gotwalt, C., & Ramsey, P. (2018). Model Validation Strategies for Designed Experiments Using Bootstrapping Techniques With Applications to Biopharmaceuticals.
JMP Discovery Conference.
Link
Ramsey, P., Gaudard, M., & Levin, W. (2021). Accelerating Innovation with Space-Filling Mixture Designs, Neural Networks, and SVEM.
JMP Discovery Conference.
Link
Ramsey, P., & Gotwalt, C. (2018). Model Validation Strategies for Designed Experiments Using Bootstrapping Techniques With Applications to Biopharmaceuticals.
JMP Discovery Summit Europe.
Link
Ramsey, P., Levin, W., Lemkus, T., & Gotwalt, C. (2021). SVEM: A Paradigm Shift in Design and Analysis of Experiments.
JMP Discovery Summit Europe.
Link
Ramsey, P., & McNeill, P. (2023). CMC, SVEM, Neural Networks, DOE, and Complexity: It’s All About Prediction.
JMP Discovery Conference.
Karl, A., Wisnowski, J., & Rushing, H. (2022). JMP Pro 17 Remedies for Practical Struggles with Mixture Experiments.
JMP Discovery Conference.
Link
Xu, L., Gotwalt, C., Hong, Y., King, C. B., & Meeker, W. Q. (2020). Applications of the Fractional-Random-Weight Bootstrap.
The American Statistician, 74(4), 345–358.
Link
Karl, A. T. (2024). SVEMnet: Self-Validated Ensemble Models with Elastic Net Regression.
R package
JMP Help Documentation Overview of Self-Validated Ensemble Models.
Link

SVEMnet Vignette

Andrew T. Karl

September 25, 2025

Version

Summary