Statistical inference after model selection is challenging because
standard methods often ignore the selection process, leading to biased
results like confidence intervals with lower-than-nominal coverage. This
issue, known as post-selection inference (PoSI), is addressed by the
PosiR
package, which implements the methodology from Kuchibhotla, Kolassa, and Kuffner (2022).
PosiR
constructs confidence intervals that are
simultaneously valid across a pre-defined universe of models, ensuring
the desired coverage probability (e.g., 95%) regardless of the model
selection process.
This vignette introduces Post-Selection Inference concepts, lists the package functions, explains the methodology, and provides examples.
The PosiR package provides two core functions:
simultaneous_ci()
Key Arguments:
X
: Design matrix (n × p, no intercept).
y
: Response vector (length n).
Q_universe
: Named list of model indices (including
intercept if add_intercept = TRUE
).
B
: Number of bootstrap samples (default:
1000).
alpha
: Significance level (default: 0.05).
verbose
: Logical; if TRUE
, shows
progress messages.
Returns:
intervals
: Data frame with columns
model_id
, coefficient_name
,
estimate
, lower
, upper
,
psi_hat_nqj
, se_nqj
.
K_alpha
: Critical value for simultaneous
coverage.
Other diagnostic outputs (e.g., T_star_b
,
``warnings).
plot.simultaneous_ci_result()
Visualizes results from simultaneous_ci().
Key Arguments:
x
: A simultaneous_ci_result
object.
subset_pars
: Vector of coefficient names to
plot.
las.labels
: Label orientation (e.g., 2 for
perpendicular).
col.ci
: Color of the confidence intervals.
main
: Plot title.
PosiR
constructs confidence intervals for coefficients
\(\theta_{q,j}\) (where \(q\) denotes a model in the universe 𝒬 and
\(j\) denotes a specific coefficient
within that model) that are simultaneously valid.
\[ P(\theta_{q,j} \in \widehat{\text{CI}}_{q,j} \text{ for all } q \in 𝒬, j \text{ in model } q) \ge 1-\alpha \]
The interval for \(\theta_{q,j}\) is: \[ \widehat{\text{CI}}_{q,j} = \left[ \widehat{\theta}_{q,j} \pm \widehat{K}_{\alpha} \frac{\widehat{\Psi}_{n,q,j}^{1/2}}{\sqrt{n}} \right] \] where:
\(\widehat{\theta}_{q,j}\) is the estimate of the coefficient (e.g., from OLS) in model \(q\) using the original data.
\(\widehat{\Psi}_{n,q,j}\) is an estimate of the asymptotic variance of \(\sqrt{n}(\widehat{\theta}_{q,j} - \theta_{q,j})\).
\(\widehat{K}_{\alpha}\) is a critical value (quantile) obtained from a bootstrap procedure, designed to ensure simultaneous coverage.
The bootstrap procedure (Algorithm 1 from Kuchibhotla, Kolassa, and Kuffner (2022)) is:
Original Estimate: Compute \(\widehat{\theta}_{q,j}\) for all relevant \(j\) in all models \(q \in 𝒬\) using the original data.
Bootstrap Sampling: For \(b = 1, \ldots, B\), resample pairs \((X_i, y_i)\) with replacement from the original data to compute the bootstrap estimates \(\widehat{\theta}_{q,j}^{*b}\) for all \(j, q\).
Variance Estimation:
\[ \widehat{\Psi}_{n,q,j} := \frac{1}{B_{valid}-1} \sum_{b \in \text{valid}} \left[ \sqrt{n} \left( \widehat{\theta}_{q,j}^{*b} - \widehat{\theta}_{q,j} \right) \right]^2 \]
Bootstrap Max-t Statistics (\(T^{*b}\)) \[ T^{*b} := \max_{q \in 𝒬} \max_{j \in \text{coeffs}(q)} \left| n^{1/2} \widehat{\Psi}_{n,q,j}^{-1/2} \left( \widehat{\theta}_{q,j}^{*b} - \widehat{\theta}_{q,j} \right) \right| \]
Critical Value: Compute \(\widehat{K}_{\alpha}\) as the \((1-\alpha)\) quantile of the collected bootstrap max-t statistics \(\{ T^{*1}, \ldots, T^{*B} \}\).
Confidence Intervals: \[ \widehat{\theta}_{q,j} \pm \widehat{K}_{\alpha} \sqrt{\widehat{\Psi}_{n,q,j}/n} \]
By using \(\widehat{K}_{\alpha}\) derived from the maximum statistic across the model universe, the resulting intervals achieve the desired simultaneous coverage probability: \[ P(\text{all } \theta_{q,j} \in \widehat{\text{CI}}_{q,j}) \ge 1-\alpha \]
First, ensure devtools
is installed:
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
devtools::install_github("Chukyhenry/PosiR")
Optional dependencies:
pbapply
: For progress bars during computation.
glmnet
: For the Lasso example (Example 3).
if (!requireNamespace("dplyr", quietly = TRUE)) {
stop("Please install the 'dplyr' package to run this vignette.")
}
library(PosiR)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(stats)
We simulate data where Age affects the response \(\beta_{\text{Age}} = 0.5\).
set.seed(123) # for reproducibility
X <- matrix(rnorm(200 * 2), ncol = 2, dimnames = list(NULL, c("Age", "Income")))
y <- 0.5 * X[, "Age"] + rnorm(200) # the true intercept is 0, the true coefficient for Age is 0.5, and the true coefficient for Income is 0.
We specify a universe \(𝒬\)
containing two potential models: one with only Age and one with both Age
and Income. The indices refer to the columns in X. Note that
simultaneous_ci
adds an intercept by default
(add_intercept = TRUE
), which will also be included in the
inference.
We call the function, providing the data X, y, and the model universe
Q
. We use B=1000
bootstrap samples for
reasonably stable results, cores=2
to speed up computation
via parallel processing (adjust based on your machine), and seed for
reproducibility. verbose=FALSE
suppresses progress
messages.
The output result
is a list. Key components include:
intervals
: A data frame with estimates and
simultaneous CIs.
K_alpha
: The calculated critical value \(\widehat{K}_{\alpha}\).
# Show the calculated critical value
print(paste("Calculated K_alpha:", round(result$K_alpha, 3)))
#> [1] "Calculated K_alpha: 2.347"
# Display the intervals data frame
print(result$intervals)
#> model_id coefficient_name estimate lower upper psi_hat_nqj
#> 1 Age_Income (Intercept) 0.03151043 -0.1307487 0.1937696 0.9561812
#> 2 Age_Income Age 0.46907362 0.2712020 0.6669452 1.4219643
#> 3 Age_only (Intercept) 0.02749026 -0.1483899 0.2033704 1.1234542
#> se_nqj
#> 1 0.06914410
#> 2 0.08431976
#> 3 0.07494846
The columns of the resulting intervals
data frame
are:
model_id
: Name of the model from
Q
.
coefficient_name
: Name of the coefficient (including
intercept).
estimate
: Point estimate \(\widehat{\theta}_{q,j}\).
lower
, upper
: Bounds of the \((1-\alpha)\)% simultaneous confidence
interval.
psi_hat_nqj
: Bootstrap variance estimate \(\widehat{\Psi}_{n,q,j}\).
se_nqj
: Standard error estimate \(\sqrt{\widehat{\Psi}_{n,q,j}/n}\).
We use the S3 plot method for simultaneous_ci_result
objects. las.labels=2
rotates labels for readability.
#### Interpretation
Age Coefficient: The point estimates for
Age
in both models are close to the true value of 0.5.
Importantly, the simultaneous 95% confidence intervals (e.g., approx.
[0.26, 0.67] for the Age_Income
model) comfortably contain
the true value 0.5 and exclude 0, correctly indicating a significant
effect.
Income Coefficient: The point estimate for
Income
(only present in the Age_Income
model)
is close to 0. The simultaneous CI (approx. [-0.19, 0.17]) contains the
true value 0, correctly suggesting no significant effect.
Intercept: The point estimates for the
(Intercept)
are close to 0. The simultaneous CIs contain
the true value 0, indicating no significant intercept term.
Simultaneous Validity: These intervals are valid simultaneously. We can be 95% confident that all these reported intervals contain their respective true parameter values, regardless of whether we decided to focus on the “Age_only” or “Age_Income” model after looking at the data.
mtcars
DatasetNow, let’s apply the method to a real dataset, mtcars
,
to predict miles per gallon (mpg).
We select hp (horsepower), wt (weight), and qsec (1/4 mile time) as potential predictors for mpg.
We consider models involving each predictor individually and a full
model with all three. The indices refer to columns in
X_mtcars
. The intercept is added automatically by
simultaneous_ci
.
simultaneous_ci()
We calculate the standard “naive” 95% confidence intervals using the normal approximation \(z_{\alpha/2}\) quantile and the same standard error estimate \(\sqrt{\widehat{\Psi}_{n,q,j}/n}\) derived from the bootstrap. The naive interval is: \[ \widehat{\text{CI}}_{q,j}^{\text{unadj}} = \left[ \widehat{\theta}_{q,j} \pm z_{\alpha/2} \frac{\widehat{\Psi}_{n,q,j}^{1/2}}{\sqrt{n}} \right] \] where \(z_{\alpha/2} \approx 1.96\) for \(\alpha = 0.05\).
z_alpha_2 <- stats::qnorm(0.975)
intervals_comparison <- result_mtcars$intervals %>%
mutate(
naive_lower = estimate - z_alpha_2 * se_nqj,
naive_upper = estimate + z_alpha_2 * se_nqj
) %>%
select(model_id, coefficient_name, estimate, lower, upper, naive_lower, naive_upper) %>%
arrange(model_id, coefficient_name) # Ensure consistent ordering
print(intervals_comparison, digits = 3)
#> model_id coefficient_name estimate lower upper naive_lower naive_upper
#> 1 Full_Model (Intercept) 37.2273 31.5341 42.9204 33.0148 41.4397
#> 2 Full_Model hp -0.0318 -0.0524 -0.0112 -0.0470 -0.0165
#> 3 Full_Model wt -3.8778 -5.7458 -2.0099 -5.2599 -2.4957
#> 4 HP_only (Intercept) 20.0906 17.2695 22.9118 18.0032 22.1780
#> 5 QSEC_only wt 5.2916 3.8172 6.7660 4.2007 6.3826
#> 6 WT_only hp 0.1011 0.0666 0.1356 0.0756 0.1266
Interval Width: As expected, the simultaneous intervals (lower, upper) are uniformly wider than the corresponding naive intervals (naive_lower, naive_upper). This widening is caused by the multiplier \(\widehat{K}_{\alpha}\) being larger than \(z_{\alpha/2} \approx 1.96\).
Significance: For these mtcars models, the conclusions about which coefficients are statistically significant (i.e., interval excludes zero) do not change between the naive and simultaneous intervals. For example, hp and wt generally show significant negative associations with mpg, while qsec and the intercept terms show less significance.
PoSI Correction: The key takeaway is that the simultaneous intervals provide valid inference after considering multiple models. If we had used the naive intervals and selected, for instance, the Full_Model because it looked “best” based on some criterion, the naive intervals for that model’s coefficients would not strictly be 95% confidence intervals due to the selection bias. The simultaneous intervals, however, maintain their 95% coverage guarantee across the entire set Q_mtcars. This correction ensures our conclusions are not overly optimistic.
A common workflow involves using a variable selection method like
Lasso to choose a smaller model from many potential predictors, followed
by inference on the selected model. PosiR
can provide valid
post-selection inference in this context by defining a universe \(𝒬\) that includes at least the
Lasso-selected model.
We simulate data with \(100\) observations and \(50\) predictors, where only the first 5 predictors (\(X_1, \ldots, X_5\)) have non-zero coefficients.
set.seed(123)
n_lasso <- 100
p_lasso <- 50
X_lasso <- matrix(rnorm(n_lasso * p_lasso), n_lasso, p_lasso,
dimnames = list(NULL, paste0("X", 1:p_lasso)))
# True coefficients: 1 for first 5, 0 for rest
beta_true <- c(rep(1.0, 5), rep(0, p_lasso - 5))
true_intercept <- 0.5
# Generate response (linear model)
y_lasso <- drop(true_intercept + X_lasso %*% beta_true + rnorm(n_lasso))
We use glmnet
to perform Lasso regression and identify
variables selected using cross-validation (lambda.min).
# This chunk only runs if glmnet is installed
if (requireNamespace("glmnet", quietly = TRUE)) {
library(glmnet)
# Code using glmnet
} else {
message("glmnet is not available. Skipping this example.")
}
#> Loading required package: Matrix
#> Loaded glmnet 4.1-8
library(glmnet)
# Use cv.glmnet to find optimal lambda
cv_lasso_fit <- cv.glmnet(X_lasso, y_lasso, alpha = 1) # alpha=1 for Lasso
# Get coefficients at lambda.min, excluding the intercept
lasso_coeffs <- coef(cv_lasso_fit, s = "lambda.min")[-1, 1]
# Identify indices of selected variables (non-zero coefficients)
# Indices are relative to the columns of X_lasso (1 to p_lasso)
select_var_index <- which(lasso_coeffs != 0)
# Get names of selected variables
select_var_names <- names(select_var_index)
cat("Lasso selected variables (indices in X_lasso):", paste(select_var_index, collapse = ", "), "\n")
#> Lasso selected variables (indices in X_lasso): 1, 2, 3, 4, 5, 9, 13, 18, 22, 24, 26, 29, 33, 36, 41
cat("Lasso selected variables (names):", paste(select_var_names, collapse = ", "), "\n")
#> Lasso selected variables (names): X1, X2, X3, X4, X5, X9, X13, X18, X22, X24, X26, X29, X33, X36, X41
# If no variables selected (unlikely here), handle gracefully for Q definition
if (length(select_var_index) == 0) {
warning("Lasso selected no variables. Defining Q with intercept-only and full model.")
select_var_index1 <- 1 # Just intercept index for design matrix
} else {
# IMPORTANT: Need indices relative to the design matrix used in simultaneous_ci
# which includes intercept as column 1. Add 1 to Lasso indices.
select_var_index1 <- c(1, select_var_index + 1)
}
# Fallback if glmnet is not installed
if (!requireNamespace("glmnet", quietly = TRUE)) {
warning("glmnet package not found. Using arbitrary selection for vignette.", call. = FALSE)
select_var_index <- c(1, 2, 3, 4, 5, 9, 13, 18, 22, 24, 26, 29, 33, 36, 41)
select_var_names <- paste0("X", select_var_index)
select_var_index1 <- c(1, select_var_index + 1)
}
Our universe \(𝒬\) for post-selection inference must include the model selected by Lasso. We can also include other models, like the full model, for comparison or robustness.
# Indices for X in simultaneous_ci context (assuming add_intercept=TRUE)
# Intercept is 1, X1 is 2, ..., Xp is p+1
full_model_indices <- 1:(p_lasso + 1)
Q_lasso <- list(
# Model selected by Lasso (Intercept + selected X's)
LassoSelected = select_var_index1,
# Full model (Intercept + All X's) - optional, but common for context
FullModel = full_model_indices
)
# Check if Lasso selection was empty
if (length(select_var_index1) == 1 && select_var_index1 == 1) {
Q_lasso$LassoSelected <- 1 # Ensure it's just the intercept index
}
print("Models in Q_lasso (indices refer to design matrix with intercept):")
#> [1] "Models in Q_lasso (indices refer to design matrix with intercept):"
# Print only first few indices for brevity if models are large
print(lapply(Q_lasso, function(idx) if(length(idx)>10) c(idx[1:5], "...", idx[length(idx)]) else idx))
#> $LassoSelected
#> X1 X2 X3 X4 X41
#> "1" "2" "3" "4" "5" "..." "42"
#>
#> $FullModel
#> [1] "1" "2" "3" "4" "5" "..." "51"
Run simultaneous_ci()
We now run PosiR
using the original data (X_lasso,
y_lasso) and the universe Q_lasso containing the data-dependent Lasso
model.
intervals_lasso_comp <- result_lasso$intervals %>%
filter(model_id == "LassoSelected") %>% # Focus on the Lasso model
mutate(
naive_lower = estimate - qnorm(0.975) * se_nqj,
naive_upper = estimate + qnorm(0.975) * se_nqj
) %>%
select(coefficient_name, estimate, lower, upper, naive_lower, naive_upper) %>%
arrange(match(coefficient_name, c("(Intercept)", paste0("X", 1:p_lasso)))) # Order numerically
print(intervals_lasso_comp, digits = 3)
#> coefficient_name estimate lower upper naive_lower naive_upper
#> 1 (Intercept) 0.5416 0.20017 0.883 0.36796 0.7153
#> 2 X1 0.9414 0.57313 1.310 0.75409 1.1287
#> 3 X2 1.0042 0.67599 1.332 0.83728 1.1712
#> 4 X3 1.1591 0.80776 1.510 0.98038 1.3378
#> 5 X4 0.9961 0.67931 1.313 0.83498 1.1573
#> 6 X5 0.9351 0.53710 1.333 0.73264 1.1375
#> 7 X9 -0.1018 -0.39837 0.195 -0.25264 0.0491
#> 8 X13 0.1298 -0.22662 0.486 -0.05150 0.3111
#> 9 X18 0.3062 0.00275 0.610 0.15186 0.4606
#> 10 X22 -0.0728 -0.46496 0.319 -0.27228 0.1266
#> 11 X24 0.2759 -0.02264 0.574 0.12406 0.4278
#> 12 X26 0.1806 -0.19081 0.552 -0.00831 0.3695
#> 13 X29 -0.1496 -0.51978 0.221 -0.33786 0.0388
#> 14 X33 -0.1290 -0.45636 0.198 -0.29550 0.0375
#> 15 X36 -0.1254 -0.48194 0.231 -0.30676 0.0559
#> 16 X41 0.0999 -0.20232 0.402 -0.05381 0.2537
# Plot only coefficients from the Lasso-selected model
selected_plot <- result_lasso$intervals %>%
filter(model_id == "LassoSelected") %>%
pull(coefficient_name)
plot(result_lasso, subset_pars = selected_plot,
main = "Simultaneous CIs for Lasso-Selected Model", las.labels = 1)
#> Warning: Labels may be too long for the plot width on this device. Consider
#> subsetting, 'label.trim', reducing 'cex.labels'/'las.labels', or resizing plot
#> device.
Post-Selection Validity: The simultaneous confidence intervals (lower, upper) provide statistically valid 95% coverage after selecting the model using Lasso. They account for the uncertainty introduced by the data-driven variable selection step.
Protection Against False Discoveries: For the true predictors (X1-X5), both interval types correctly identify them as significant. However, for several noise predictors incorrectly selected by Lasso (X18, X24), the naive intervals misleadingly suggest statistical significance (they exclude the true value of 0). The simultaneous intervals, being wider due to the \(\widehat{K}_{\alpha}\) adjustment, correctly contain 0 for these noise predictors, thus protecting against false positive conclusions.
Cost of Validity: This protection comes at the cost of wider intervals for all coefficients, reflecting the true uncertainty post-selection.
The PosiR
package provides a practical implementation of
the simultaneous inference framework from Kuchibhotla, Kolassa, and Kuffner (2022). By
using the simultaneous_ci()
function, researchers can
obtain valid confidence intervals for linear model coefficients even
when model selection has been performed, ensuring robust and reliable
statistical conclusions. Remember that the validity is with respect to
the specified universe of models \(𝒬\).