yaap fits archetypal analysis (AA) models. In a
tidymodels workflow, it is best understood as an unsupervised
decomposition or feature-extraction tool, closer in role to
recipes::step_pca() than to a supervised
parsnip model specification. yaap learns
archetype coordinates and sample composition weights, then exposes those
results through interfaces that work naturally with the tidymodels
ecosystem:
step_archetypes() is a recipes step that
learns archetypes during prep() and returns archetype
composition weights during bake().This vignette uses only lightweight pieces of the ecosystem. Because
AA is an unsupervised representation method, the examples focus on
preprocessing, feature extraction, and model inspection rather than on
parsnip fitting.
The examples use the built-in USArrests data, the same
data used in the recipes::step_pca() examples. The four
numeric variables are on different scales, so the recipes below
normalize them before fitting either PCA or AA. State regions are used
only for visualization.
arrests <- USArrests
state <- rownames(arrests)
region <- state.region[match(state, state.name)]
analysis_K <- 3
region_cols <- c(
"Northeast" = "#0072B2",
"South" = "#D55E00",
"North Central" = "#009E73",
"West" = "#CC79A7"
)
head(data.frame(state = state, region = region, arrests, row.names = NULL))
#> state region Murder Assault UrbanPop Rape
#> 1 Alabama South 13.2 236 58 21.2
#> 2 Alaska West 10.0 263 48 44.5
#> 3 Arizona West 8.1 294 80 31.0
#> 4 Arkansas South 8.8 190 50 19.5
#> 5 California West 9.0 276 91 40.6
#> 6 Colorado West 7.9 204 78 38.7The examples below use K = 3 archetypes so that the
resulting composition weights can be inspected on a two-dimensional
simplex.
step_archetypes() turns AA into a preprocessing step.
During prep(), the step fits archetypes on the selected
numeric predictors. During bake(), it projects data onto
the learned archetype simplex and returns one composition column per
archetype.
The recipes argument is called num_comp, following
tidymodels naming conventions, but it is the same quantity as
K.
rec_base <- recipe(~ ., data = arrests) |>
step_normalize(all_numeric())
rec <- rec_base |>
step_archetypes(
all_numeric(),
num_comp = analysis_K,
seed = 42,
options = list(
init = "furthest_sum",
max_iter = 100,
tol_r2 = 0
)
)
rec_prep <- prep(rec, training = arrests)
rec_prep
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> predictor: 4
#>
#> ── Training information
#> Training data contained 50 data points and no incomplete rows.
#>
#> ── Operations
#> • Centering and scaling for: Murder, Assault, UrbanPop, Rape | Trained
#> • Archetypal Analysis Projection (K=3, delta=0): Murder, ... | Trainedbaked <- bake(rec_prep, new_data = arrests)
head(data.frame(state = state, region = region, baked, row.names = NULL))
#> state region A1 A2 A3
#> 1 Alabama South 0.35711557 9.999945e-09 6.428844e-01
#> 2 Alaska West 0.27961708 3.303461e-01 3.900368e-01
#> 3 Arizona West 0.15294787 5.772798e-01 2.697723e-01
#> 4 Arkansas South 0.55200351 9.999964e-09 4.479965e-01
#> 5 California West 0.00000001 1.000000e+00 1.000000e-08
#> 6 Colorado West 0.19388960 8.061104e-01 9.999962e-09
aa_cols <- paste0("A", seq_len(analysis_K))
baked[, aa_cols] |> as.matrix() |> rowSums() |> range()
#> [1] 1 1By default, the original numeric predictors are removed and replaced
by composition columns named A1, A2, and so
on. Set keep_original_cols = TRUE when both the normalized
predictors and archetype weights should be retained.
rec_keep <- rec_base |>
step_archetypes(
all_numeric(),
num_comp = analysis_K,
keep_original_cols = TRUE,
seed = 42,
options = list(
max_iter = 100,
tol_r2 = 0
)
) |>
prep(training = arrests)
arrests_with_originals <- data.frame(
state = state,
region = region,
bake(rec_keep, new_data = arrests),
row.names = NULL
)
head(arrests_with_originals)
#> state region Murder Assault UrbanPop Rape A1
#> 1 Alabama South 1.24256408 0.7828393 -0.5209066 -0.003416473 0.35711557
#> 2 Alaska West 0.50786248 1.1068225 -1.2117642 2.484202941 0.27961708
#> 3 Arizona West 0.07163341 1.4788032 0.9989801 1.042878388 0.15294787
#> 4 Arkansas South 0.23234938 0.2308680 -1.0735927 -0.184916602 0.55200351
#> 5 California West 0.27826823 1.2628144 1.7589234 2.067820292 0.00000001
#> 6 Colorado West 0.02571456 0.3988593 0.8608085 1.864967207 0.19388960
#> A2 A3
#> 1 9.999945e-09 6.428844e-01
#> 2 3.303461e-01 3.900368e-01
#> 3 5.772798e-01 2.697723e-01
#> 4 9.999964e-09 4.479965e-01
#> 5 1.000000e+00 1.000000e-08
#> 6 8.061104e-01 9.999962e-09step_archetypes() with
step_pca()The closest tidymodels analogy for yaap is
recipes::step_pca(): both steps learn an unsupervised
representation of the numeric predictors during prep(). The
outputs have different meanings. PCA returns orthogonal score columns
such as PC1, PC2, and PC3; AA
returns composition weights such as A1, A2,
and A3 that are non-negative and sum to one for each
state.
The two recipes below are trained in parallel on the same normalized arrest variables.
rec_aa <- rec_base |>
step_archetypes(
all_numeric(),
num_comp = analysis_K,
seed = 42,
options = list(
init = "furthest_sum",
max_iter = 100,
tol_r2 = 0
)
) |>
prep(training = arrests)
rec_pca <- rec_base |>
step_pca(all_numeric(), num_comp = analysis_K) |>
prep(training = arrests)
aa_scores <- bake(rec_aa, new_data = arrests)
pca_scores <- bake(rec_pca, new_data = arrests)
score_data <- data.frame(
state = state,
region = region,
pca_scores,
aa_scores,
row.names = NULL
)
head(score_data)
#> state region PC1 PC2 PC3 A1 A2
#> 1 Alabama South -0.9756604 -1.1220012 0.43980366 0.35711557 9.999945e-09
#> 2 Alaska West -1.9305379 -1.0624269 -2.01950027 0.27961708 3.303461e-01
#> 3 Arizona West -1.7454429 0.7384595 -0.05423025 0.15294787 5.772798e-01
#> 4 Arkansas South 0.1399989 -1.1085423 -0.11342217 0.55200351 9.999964e-09
#> 5 California West -2.4986128 1.5274267 -0.59254100 0.00000001 1.000000e+00
#> 6 Colorado West -1.4993407 0.9776297 -1.08400162 0.19388960 8.061104e-01
#> A3
#> 1 6.428844e-01
#> 2 3.900368e-01
#> 3 2.697723e-01
#> 4 4.479965e-01
#> 5 1.000000e-08
#> 6 9.999962e-09The PCA step returns orthogonal score columns. The AA step returns convex composition weights over three estimated archetypal states. Inspecting both trained steps highlights the difference: PCA loadings describe directions of variation, while AA coordinates describe extreme profiles in the normalized feature space.
tidy(rec_pca, number = 2, type = "variance")
#> # A tibble: 16 × 4
#> terms value component id
#> <chr> <dbl> <int> <chr>
#> 1 variance 2.48 1 pca_cOyAJ
#> 2 variance 0.990 2 pca_cOyAJ
#> 3 variance 0.357 3 pca_cOyAJ
#> 4 variance 0.173 4 pca_cOyAJ
#> 5 cumulative variance 2.48 1 pca_cOyAJ
#> 6 cumulative variance 3.47 2 pca_cOyAJ
#> 7 cumulative variance 3.83 3 pca_cOyAJ
#> 8 cumulative variance 4 4 pca_cOyAJ
#> 9 percent variance 62.0 1 pca_cOyAJ
#> 10 percent variance 24.7 2 pca_cOyAJ
#> 11 percent variance 8.91 3 pca_cOyAJ
#> 12 percent variance 4.34 4 pca_cOyAJ
#> 13 cumulative percent variance 62.0 1 pca_cOyAJ
#> 14 cumulative percent variance 86.8 2 pca_cOyAJ
#> 15 cumulative percent variance 95.7 3 pca_cOyAJ
#> 16 cumulative percent variance 100 4 pca_cOyAJ
tidy(rec_aa, number = 2)
#> # A tibble: 12 × 4
#> archetype term value id
#> <chr> <chr> <dbl> <chr>
#> 1 A1 Murder -1.28 archetypes_TtzXU
#> 2 A2 Murder 0.278 archetypes_TtzXU
#> 3 A3 Murder 1.75 archetypes_TtzXU
#> 4 A1 Assault -1.47 archetypes_TtzXU
#> 5 A2 Assault 1.26 archetypes_TtzXU
#> 6 A3 Assault 1.97 archetypes_TtzXU
#> 7 A1 UrbanPop -2.32 archetypes_TtzXU
#> 8 A2 UrbanPop 1.76 archetypes_TtzXU
#> 9 A3 UrbanPop 0.999 archetypes_TtzXU
#> 10 A1 Rape -1.07 archetypes_TtzXU
#> 11 A2 Rape 2.07 archetypes_TtzXU
#> 12 A3 Rape 1.14 archetypes_TtzXUNext, the scores from both steps are plotted. The PCA scores are plotted in the usual Cartesian coordinate system. The AA composition weights are plotted on a ternary simplex, where each vertex corresponds to one archetype. The same states are labeled in both plots: states selected either because they are extreme on one of the first two principal-component axes or because they are closest to an archetype vertex.
pca_rng <- extendrange(c(score_plot_data$PC1, score_plot_data$PC2))
ggplot(score_plot_data, aes(PC1, PC2, colour = region)) +
geom_hline(yintercept = 0, colour = "grey85") +
geom_vline(xintercept = 0, colour = "grey85") +
geom_point(size = 1.8) +
geom_text(
aes(label = label),
na.rm = TRUE,
nudge_y = diff(pca_rng) * 0.035,
size = 3,
show.legend = FALSE
) +
coord_equal(xlim = pca_rng, ylim = pca_rng) +
scale_colour_manual(values = region_cols, drop = FALSE) +
labs(title = "step_pca()", x = "PC1", y = "PC2", colour = "Region") +
theme_minimal(base_size = 10) +
theme(panel.grid.minor = element_blank())PCA scores for USArrests. Labels mark states selected as PCA extremes or AA simplex-vertex representatives.
ggtern(score_plot_data, aes(A1, A2, A3, colour = region)) +
geom_point(size = 1.8) +
geom_text(
aes(label = label),
na.rm = TRUE,
position = "identity",
size = 3,
show.legend = FALSE
) +
scale_colour_manual(values = region_cols, drop = FALSE) +
labs(
title = "step_archetypes()",
x = "A1",
y = "A2",
z = "A3",
colour = "Region"
) +
theme_minimal(base_size = 10) +
theme_showarrows() +
theme(
legend.position = "bottom",
plot.margin = margin(8, 16, 8, 16),
tern.panel.mask.show = FALSE
)AA composition weights for USArrests on the archetype simplex. Labels match the PCA plot for comparison.
AA preprocessing returns scores that are barycentric coordinates:
they are non-negative and sum to one for each observation. Tree models
and other flexible learners can often use these scores directly. Linear,
distance-based, or Gaussian models may be sensitive to the simplex
constraint and the induced collinearity of the raw weights. In those
settings, a typical transformation is the isometric log-ratio
(ilr), which maps compositions from the simplex to ordinary
Euclidean space. Log-ratio transforms require strictly positive
compositions, so apply an appropriate zero-handling strategy first if
any composition weights are zero. See package compositions
for details on working with compositional data, including
ilr and other transformations that can be used after
bake() when a downstream model expects ordinary Euclidean
predictors.
step_archetypes() declares num_comp and
delta as tunable recipe parameters. This lets the step
participate in a larger tidymodels tuning workflow when the rest of the
modeling pipeline supplies resampling, metrics, and an outcome.
rec_tune <- rec_base |>
step_archetypes(
all_numeric(),
num_comp = tune(),
delta = tune()
)
tunable(rec_tune$steps[[2L]])
#> # A tibble: 2 × 5
#> name call_info source component component_id
#> <chr> <list> <chr> <chr> <chr>
#> 1 num_comp <named list [3]> recipe step_archetypes archetypes_SetHB
#> 2 delta <named list [3]> recipe step_archetypes archetypes_SetHBFor supervised modeling pipelines, use step_archetypes()
inside the broader tidymodels workflow and tune num_comp
alongside the downstream model parameters.