PosiR: Post-Selection Inference via Simultaneous Confidence Intervals

Statistical inference after model selection is challenging because standard methods often ignore the selection process, leading to biased results like confidence intervals with lower-than-nominal coverage. This issue, known as post-selection inference (PoSI), is addressed by the PosiR package, which implements the methodology from Kuchibhotla, Kolassa, and Kuffner (2022). PosiR constructs confidence intervals that are simultaneously valid across a pre-defined universe of models, ensuring the desired coverage probability (e.g., 95%) regardless of the model selection process.

This vignette introduces Post-Selection Inference concepts, lists the package functions, explains the methodology, and provides examples.

Package Functions

The PosiR package provides two core functions:

simultaneous_ci()
Computes simultaneous confidence intervals for linear regression coefficients across a user-specified universe of models.

Key Arguments:

X: Design matrix (n × p, no intercept).
y: Response vector (length n).
Q_universe: Named list of model indices (including intercept if add_intercept = TRUE).
B: Number of bootstrap samples (default: 1000).
alpha: Significance level (default: 0.05).
verbose: Logical; if TRUE, shows progress messages.

Returns:

intervals: Data frame with columns model_id, coefficient_name, estimate, lower, upper, psi_hat_nqj, se_nqj.
K_alpha: Critical value for simultaneous coverage.
Other diagnostic outputs (e.g., T_star_b, ``warnings).

plot.simultaneous_ci_result()

Visualizes results from simultaneous_ci().

Key Arguments:

x: A simultaneous_ci_result object.
subset_pars: Vector of coefficient names to plot.
las.labels: Label orientation (e.g., 2 for perpendicular).
col.ci: Color of the confidence intervals.
main: Plot title.

Theoretical Foundation

PosiR constructs confidence intervals for coefficients \(\theta_{q,j}\) (where \(q\) denotes a model in the universe 𝒬 and \(j\) denotes a specific coefficient within that model) that are simultaneously valid.

\[ P(\theta_{q,j} \in \widehat{\text{CI}}_{q,j} \text{ for all } q \in 𝒬, j \text{ in model } q) \ge 1-\alpha \]

The interval for \(\theta_{q,j}\) is: \[ \widehat{\text{CI}}_{q,j} = \left[ \widehat{\theta}_{q,j} \pm \widehat{K}_{\alpha} \frac{\widehat{\Psi}_{n,q,j}^{1/2}}{\sqrt{n}} \right] \] where:

\(\widehat{\theta}_{q,j}\) is the estimate of the coefficient (e.g., from OLS) in model \(q\) using the original data.
\(\widehat{\Psi}_{n,q,j}\) is an estimate of the asymptotic variance of \(\sqrt{n}(\widehat{\theta}_{q,j} - \theta_{q,j})\).
\(\widehat{K}_{\alpha}\) is a critical value (quantile) obtained from a bootstrap procedure, designed to ensure simultaneous coverage.

The bootstrap procedure (Algorithm 1 from Kuchibhotla, Kolassa, and Kuffner (2022)) is:

Original Estimate: Compute \(\widehat{\theta}_{q,j}\) for all relevant \(j\) in all models \(q \in 𝒬\) using the original data.
Bootstrap Sampling: For \(b = 1, \ldots, B\), resample pairs \((X_i, y_i)\) with replacement from the original data to compute the bootstrap estimates \(\widehat{\theta}_{q,j}^{*b}\) for all \(j, q\).
Variance Estimation:

\[ \widehat{\Psi}_{n,q,j} := \frac{1}{B_{valid}-1} \sum_{b \in \text{valid}} \left[ \sqrt{n} \left( \widehat{\theta}_{q,j}^{*b} - \widehat{\theta}_{q,j} \right) \right]^2 \]
Bootstrap Max-t Statistics (\(T^{*b}\)) \[ T^{*b} := \max_{q \in 𝒬} \max_{j \in \text{coeffs}(q)} \left| n^{1/2} \widehat{\Psi}_{n,q,j}^{-1/2} \left( \widehat{\theta}_{q,j}^{*b} - \widehat{\theta}_{q,j} \right) \right| \]
Critical Value: Compute \(\widehat{K}_{\alpha}\) as the \((1-\alpha)\) quantile of the collected bootstrap max-t statistics \(\{ T^{*1}, \ldots, T^{*B} \}\).
Confidence Intervals: \[ \widehat{\theta}_{q,j} \pm \widehat{K}_{\alpha} \sqrt{\widehat{\Psi}_{n,q,j}/n} \]

By using \(\widehat{K}_{\alpha}\) derived from the maximum statistic across the model universe, the resulting intervals achieve the desired simultaneous coverage probability: \[ P(\text{all } \theta_{q,j} \in \widehat{\text{CI}}_{q,j}) \ge 1-\alpha \]

Installation

First, ensure devtools is installed:

if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}
devtools::install_github("Chukyhenry/PosiR")

Optional dependencies:

pbapply: For progress bars during computation. glmnet: For the Lasso example (Example 3).

install.packages("pbapply")
install.packages("glmnet")

Load the package

if (!requireNamespace("dplyr", quietly = TRUE)) {
  stop("Please install the 'dplyr' package to run this vignette.")
}
library(PosiR)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stats)

Example 1: Simulated Data

We simulate data where Age affects the response \(\beta_{\text{Age}} = 0.5\).

set.seed(123) # for reproducibility
X <- matrix(rnorm(200 * 2), ncol = 2, dimnames = list(NULL, c("Age", "Income")))
y <- 0.5 * X[, "Age"] + rnorm(200)  # the true intercept is 0, the true coefficient for Age is 0.5, and the true coefficient for Income is 0.

Define the model universe:

We specify a universe \(𝒬\) containing two potential models: one with only Age and one with both Age and Income. The indices refer to the columns in X. Note that simultaneous_ci adds an intercept by default (add_intercept = TRUE), which will also be included in the inference.

Q <- list(
  Age_only = 1,              # Model with Intercept + Age
  Age_Income = c(1, 2)       # Model with Intercept + Age + Income
)
# Indices refer to columns of X: 1="Age", 2="Income"

Run simultaneous_ci():

We call the function, providing the data X, y, and the model universe Q. We use B=1000 bootstrap samples for reasonably stable results, cores=2 to speed up computation via parallel processing (adjust based on your machine), and seed for reproducibility. verbose=FALSE suppresses progress messages.

result <- simultaneous_ci(X, y, Q, B = 1000, cores = 2, verbose = FALSE)

The output result is a list. Key components include:

intervals: A data frame with estimates and simultaneous CIs.
K_alpha: The calculated critical value \(\widehat{K}_{\alpha}\).

# Show the calculated critical value
print(paste("Calculated K_alpha:", round(result$K_alpha, 3)))
#> [1] "Calculated K_alpha: 2.347"

# Display the intervals data frame
print(result$intervals)
#>     model_id coefficient_name   estimate      lower     upper psi_hat_nqj
#> 1 Age_Income      (Intercept) 0.03151043 -0.1307487 0.1937696   0.9561812
#> 2 Age_Income              Age 0.46907362  0.2712020 0.6669452   1.4219643
#> 3   Age_only      (Intercept) 0.02749026 -0.1483899 0.2033704   1.1234542
#>       se_nqj
#> 1 0.06914410
#> 2 0.08431976
#> 3 0.07494846

The columns of the resulting intervals data frame are:

model_id: Name of the model from Q.
coefficient_name: Name of the coefficient (including intercept).
estimate: Point estimate \(\widehat{\theta}_{q,j}\).
lower, upper: Bounds of the \((1-\alpha)\)% simultaneous confidence interval.
psi_hat_nqj: Bootstrap variance estimate \(\widehat{\Psi}_{n,q,j}\).
se_nqj: Standard error estimate \(\sqrt{\widehat{\Psi}_{n,q,j}/n}\).

Visualize Intervals

We use the S3 plot method for simultaneous_ci_result objects. las.labels=2 rotates labels for readability.

plot(result, las.labels = 2, col.ci = "darkblue", main = "Simultaneous Confidence Intervals")

#### Interpretation

6. Interpretation

Age Coefficient: The point estimates for Age in both models are close to the true value of 0.5. Importantly, the simultaneous 95% confidence intervals (e.g., approx. [0.26, 0.67] for the Age_Income model) comfortably contain the true value 0.5 and exclude 0, correctly indicating a significant effect.

Income Coefficient: The point estimate for Income (only present in the Age_Income model) is close to 0. The simultaneous CI (approx. [-0.19, 0.17]) contains the true value 0, correctly suggesting no significant effect.

Intercept: The point estimates for the (Intercept) are close to 0. The simultaneous CIs contain the true value 0, indicating no significant intercept term.

Simultaneous Validity: These intervals are valid simultaneously. We can be 95% confident that all these reported intervals contain their respective true parameter values, regardless of whether we decided to focus on the “Age_only” or “Age_Income” model after looking at the data.

Example 2: `mtcars` Dataset

Now, let’s apply the method to a real dataset, mtcars, to predict miles per gallon (mpg).

Prepare data

We select hp (horsepower), wt (weight), and qsec (1/4 mile time) as potential predictors for mpg.

data(mtcars)
X_mtcars <- as.matrix(mtcars[, c("hp", "wt", "qsec")])
y_mtcars <- mtcars$mpg

Define models

We consider models involving each predictor individually and a full model with all three. The indices refer to columns in X_mtcars. The intercept is added automatically by simultaneous_ci.

Q_mtcars <- list(
  HP_only = 1,         # Model: mpg ~ Intercept + hp
  WT_only = 2,         # Model: mpg ~ Intercept + wt
  QSEC_only = 3,       # Model: mpg ~ Intercept + qsec
  Full_Model = 1:3     # Model: mpg ~ Intercept + hp + wt + qsec
)
# Indices: 1="hp", 2="wt", 3="qsec"

Run `simultaneous_ci()`

result_mtcars <- simultaneous_ci(X = X_mtcars, y = y_mtcars, Q_universe = Q_mtcars,
                                 B = 2000, cores = 2, seed = 123, verbose = FALSE)

# Display the critical value
print(paste("Calculated K_alpha:", round(result_mtcars$K_alpha, 3)))
#> [1] "Calculated K_alpha: 2.649"

Compare simultaneous vs. naive 95% intervals:

We calculate the standard “naive” 95% confidence intervals using the normal approximation \(z_{\alpha/2}\) quantile and the same standard error estimate \(\sqrt{\widehat{\Psi}_{n,q,j}/n}\) derived from the bootstrap. The naive interval is: \[ \widehat{\text{CI}}_{q,j}^{\text{unadj}} = \left[ \widehat{\theta}_{q,j} \pm z_{\alpha/2} \frac{\widehat{\Psi}_{n,q,j}^{1/2}}{\sqrt{n}} \right] \] where \(z_{\alpha/2} \approx 1.96\) for \(\alpha = 0.05\).

z_alpha_2 <- stats::qnorm(0.975) 

intervals_comparison <- result_mtcars$intervals %>%
  mutate(
    naive_lower = estimate - z_alpha_2 * se_nqj,
    naive_upper = estimate + z_alpha_2 * se_nqj
  ) %>%
  select(model_id, coefficient_name, estimate, lower, upper, naive_lower, naive_upper) %>%
  arrange(model_id, coefficient_name) # Ensure consistent ordering

print(intervals_comparison, digits = 3)
#>     model_id coefficient_name estimate   lower   upper naive_lower naive_upper
#> 1 Full_Model      (Intercept)  37.2273 31.5341 42.9204     33.0148     41.4397
#> 2 Full_Model               hp  -0.0318 -0.0524 -0.0112     -0.0470     -0.0165
#> 3 Full_Model               wt  -3.8778 -5.7458 -2.0099     -5.2599     -2.4957
#> 4    HP_only      (Intercept)  20.0906 17.2695 22.9118     18.0032     22.1780
#> 5  QSEC_only               wt   5.2916  3.8172  6.7660      4.2007      6.3826
#> 6    WT_only               hp   0.1011  0.0666  0.1356      0.0756      0.1266

Visualize intervals

plot(result_mtcars, las.labels = 2, main = "mtcars Data: Simultaneous CIs")

Interpretation

Interval Width: As expected, the simultaneous intervals (lower, upper) are uniformly wider than the corresponding naive intervals (naive_lower, naive_upper). This widening is caused by the multiplier \(\widehat{K}_{\alpha}\) being larger than \(z_{\alpha/2} \approx 1.96\).

Significance: For these mtcars models, the conclusions about which coefficients are statistically significant (i.e., interval excludes zero) do not change between the naive and simultaneous intervals. For example, hp and wt generally show significant negative associations with mpg, while qsec and the intercept terms show less significance.

PoSI Correction: The key takeaway is that the simultaneous intervals provide valid inference after considering multiple models. If we had used the naive intervals and selected, for instance, the Full_Model because it looked “best” based on some criterion, the naive intervals for that model’s coefficients would not strictly be 95% confidence intervals due to the selection bias. The simultaneous intervals, however, maintain their 95% coverage guarantee across the entire set Q_mtcars. This correction ensures our conclusions are not overly optimistic.

Example 3: High-Dimensional Data with Lasso Selection

A common workflow involves using a variable selection method like Lasso to choose a smaller model from many potential predictors, followed by inference on the selected model. PosiR can provide valid post-selection inference in this context by defining a universe \(𝒬\) that includes at least the Lasso-selected model.

Simulate Data with Interaction

We simulate data with \(100\) observations and \(50\) predictors, where only the first 5 predictors (\(X_1, \ldots, X_5\)) have non-zero coefficients.

set.seed(123)
n_lasso <- 100
p_lasso <- 50
X_lasso <- matrix(rnorm(n_lasso * p_lasso), n_lasso, p_lasso,
                  dimnames = list(NULL, paste0("X", 1:p_lasso)))

# True coefficients: 1 for first 5, 0 for rest
beta_true <- c(rep(1.0, 5), rep(0, p_lasso - 5))
true_intercept <- 0.5

# Generate response (linear model)
y_lasso <- drop(true_intercept + X_lasso %*% beta_true + rnorm(n_lasso))

Perform Lasso Selection (if glmnet is available)

We use glmnet to perform Lasso regression and identify variables selected using cross-validation (lambda.min).

# This chunk only runs if glmnet is installed
if (requireNamespace("glmnet", quietly = TRUE)) {
  library(glmnet)
  # Code using glmnet
} else {
  message("glmnet is not available. Skipping this example.")
}
#> Loading required package: Matrix
#> Loaded glmnet 4.1-8
library(glmnet)

# Use cv.glmnet to find optimal lambda
cv_lasso_fit <- cv.glmnet(X_lasso, y_lasso, alpha = 1) # alpha=1 for Lasso

# Get coefficients at lambda.min, excluding the intercept
lasso_coeffs <- coef(cv_lasso_fit, s = "lambda.min")[-1, 1]

# Identify indices of selected variables (non-zero coefficients)
# Indices are relative to the columns of X_lasso (1 to p_lasso)
select_var_index <- which(lasso_coeffs != 0)

# Get names of selected variables
select_var_names <- names(select_var_index)

cat("Lasso selected variables (indices in X_lasso):", paste(select_var_index, collapse = ", "), "\n")
#> Lasso selected variables (indices in X_lasso): 1, 2, 3, 4, 5, 9, 13, 18, 22, 24, 26, 29, 33, 36, 41
cat("Lasso selected variables (names):", paste(select_var_names, collapse = ", "), "\n")
#> Lasso selected variables (names): X1, X2, X3, X4, X5, X9, X13, X18, X22, X24, X26, X29, X33, X36, X41

# If no variables selected (unlikely here), handle gracefully for Q definition
if (length(select_var_index) == 0) {
    warning("Lasso selected no variables. Defining Q with intercept-only and full model.")
    select_var_index1 <- 1 # Just intercept index for design matrix
} else {
    # IMPORTANT: Need indices relative to the design matrix used in simultaneous_ci
    # which includes intercept as column 1. Add 1 to Lasso indices.
    select_var_index1 <- c(1, select_var_index + 1)
}

# Fallback if glmnet is not installed
if (!requireNamespace("glmnet", quietly = TRUE)) {
  warning("glmnet package not found. Using arbitrary selection for vignette.", call. = FALSE)
  select_var_index <- c(1, 2, 3, 4, 5, 9, 13, 18, 22, 24, 26, 29, 33, 36, 41)
  select_var_names <- paste0("X", select_var_index)
  select_var_index1 <- c(1, select_var_index + 1)
}

Define model universe

Our universe \(𝒬\) for post-selection inference must include the model selected by Lasso. We can also include other models, like the full model, for comparison or robustness.

# Indices for X in simultaneous_ci context (assuming add_intercept=TRUE)
# Intercept is 1, X1 is 2, ..., Xp is p+1
full_model_indices <- 1:(p_lasso + 1)

Q_lasso <- list(
  # Model selected by Lasso (Intercept + selected X's)
  LassoSelected = select_var_index1,
  # Full model (Intercept + All X's) - optional, but common for context
  FullModel = full_model_indices
)

# Check if Lasso selection was empty
if (length(select_var_index1) == 1 && select_var_index1 == 1) {
    Q_lasso$LassoSelected <- 1 # Ensure it's just the intercept index
}

print("Models in Q_lasso (indices refer to design matrix with intercept):")
#> [1] "Models in Q_lasso (indices refer to design matrix with intercept):"
# Print only first few indices for brevity if models are large
print(lapply(Q_lasso, function(idx) if(length(idx)>10) c(idx[1:5], "...", idx[length(idx)]) else idx))
#> $LassoSelected
#>          X1    X2    X3    X4         X41 
#>   "1"   "2"   "3"   "4"   "5" "..."  "42" 
#> 
#> $FullModel
#> [1] "1"   "2"   "3"   "4"   "5"   "..." "51"

`Run simultaneous_ci()`

We now run PosiR using the original data (X_lasso, y_lasso) and the universe Q_lasso containing the data-dependent Lasso model.

# Note: We pass the original X matrix (without intercept) to simultaneous_ci
# It will add the intercept based on add_intercept=TRUE
result_lasso <- simultaneous_ci(
  X_lasso, y_lasso, Q_lasso,
  B = 2000, # Recommend B=2000+ for high-dim
  cores = 2, seed = 123, verbose = FALSE
)

Compare Intervals (Focus on Selected Model)

intervals_lasso_comp <- result_lasso$intervals %>%
  filter(model_id == "LassoSelected") %>% # Focus on the Lasso model
  mutate(
    naive_lower = estimate - qnorm(0.975) * se_nqj,
    naive_upper = estimate + qnorm(0.975) * se_nqj
  ) %>%
  select(coefficient_name, estimate, lower, upper, naive_lower, naive_upper) %>%
  arrange(match(coefficient_name, c("(Intercept)", paste0("X", 1:p_lasso)))) # Order numerically

print(intervals_lasso_comp, digits = 3)
#>    coefficient_name estimate    lower upper naive_lower naive_upper
#> 1       (Intercept)   0.5416  0.20017 0.883     0.36796      0.7153
#> 2                X1   0.9414  0.57313 1.310     0.75409      1.1287
#> 3                X2   1.0042  0.67599 1.332     0.83728      1.1712
#> 4                X3   1.1591  0.80776 1.510     0.98038      1.3378
#> 5                X4   0.9961  0.67931 1.313     0.83498      1.1573
#> 6                X5   0.9351  0.53710 1.333     0.73264      1.1375
#> 7                X9  -0.1018 -0.39837 0.195    -0.25264      0.0491
#> 8               X13   0.1298 -0.22662 0.486    -0.05150      0.3111
#> 9               X18   0.3062  0.00275 0.610     0.15186      0.4606
#> 10              X22  -0.0728 -0.46496 0.319    -0.27228      0.1266
#> 11              X24   0.2759 -0.02264 0.574     0.12406      0.4278
#> 12              X26   0.1806 -0.19081 0.552    -0.00831      0.3695
#> 13              X29  -0.1496 -0.51978 0.221    -0.33786      0.0388
#> 14              X33  -0.1290 -0.45636 0.198    -0.29550      0.0375
#> 15              X36  -0.1254 -0.48194 0.231    -0.30676      0.0559
#> 16              X41   0.0999 -0.20232 0.402    -0.05381      0.2537

Visualization (Selected Model Coefficients)

# Plot only coefficients from the Lasso-selected model
selected_plot <- result_lasso$intervals %>%
                            filter(model_id == "LassoSelected") %>%
                            pull(coefficient_name)

plot(result_lasso, subset_pars = selected_plot,
     main = "Simultaneous CIs for Lasso-Selected Model", las.labels = 1)
#> Warning: Labels may be too long for the plot width on this device. Consider
#> subsetting, 'label.trim', reducing 'cex.labels'/'las.labels', or resizing plot
#> device.

Interpretation

Post-Selection Validity: The simultaneous confidence intervals (lower, upper) provide statistically valid 95% coverage after selecting the model using Lasso. They account for the uncertainty introduced by the data-driven variable selection step.

Protection Against False Discoveries: For the true predictors (X1-X5), both interval types correctly identify them as significant. However, for several noise predictors incorrectly selected by Lasso (X18, X24), the naive intervals misleadingly suggest statistical significance (they exclude the true value of 0). The simultaneous intervals, being wider due to the \(\widehat{K}_{\alpha}\) adjustment, correctly contain 0 for these noise predictors, thus protecting against false positive conclusions.

Cost of Validity: This protection comes at the cost of wider intervals for all coefficients, reflecting the true uncertainty post-selection.

Vignette

PosiR: Post-Selection Inference via Simultaneous Confidence Intervals

Package Functions

Theoretical Foundation

Installation

Load the package

Example 1: Simulated Data

Define the model universe:

Run simultaneous_ci():

Visualize Intervals

6. Interpretation

Example 2: `mtcars` Dataset

Prepare data

Define models

Run `simultaneous_ci()`

Compare simultaneous vs. naive 95% intervals:

Visualize intervals

Interpretation

Example 3: High-Dimensional Data with Lasso Selection

Simulate Data with Interaction

Perform Lasso Selection (if glmnet is available)

Define model universe

`Run simultaneous_ci()`

Compare Intervals (Focus on Selected Model)

Visualization (Selected Model Coefficients)

Interpretation

Conclusion

Reference

Vignette

PosiR: Post-Selection Inference via Simultaneous Confidence Intervals

Package Functions

Theoretical Foundation

Installation

Load the package

Example 1: Simulated Data

Define the model universe:

Run simultaneous_ci():

Visualize Intervals

6. Interpretation

Example 2: mtcars Dataset

Prepare data

Define models

Run simultaneous_ci()

Compare simultaneous vs. naive 95% intervals:

Visualize intervals

Interpretation

Example 3: High-Dimensional Data with Lasso Selection

Simulate Data with Interaction

Perform Lasso Selection (if glmnet is available)

Define model universe

Run simultaneous_ci()

Compare Intervals (Focus on Selected Model)

Visualization (Selected Model Coefficients)

Interpretation

Conclusion

Reference

Example 2: `mtcars` Dataset

Run `simultaneous_ci()`

`Run simultaneous_ci()`