Tidymodels Workflows with yaap

Overview

yaap fits archetypal analysis (AA) models. In a tidymodels workflow, it is best understood as an unsupervised decomposition or feature-extraction tool, closer in role to recipes::step_pca() than to a supervised parsnip model specification. yaap learns archetype coordinates and sample composition weights, then exposes those results through interfaces that work naturally with the tidymodels ecosystem:

step_archetypes() is a recipes step that learns archetypes during prep() and returns archetype composition weights during bake().

This vignette uses only lightweight pieces of the ecosystem. Because AA is an unsupervised representation method, the examples focus on preprocessing, feature extraction, and model inspection rather than on parsnip fitting.

library(yaap)
library(generics)
library(ggplot2)
library(ggtern)
library(recipes)
library(tune)

The examples use the built-in USArrests data, the same data used in the recipes::step_pca() examples. The four numeric variables are on different scales, so the recipes below normalize them before fitting either PCA or AA. State regions are used only for visualization.

arrests <- USArrests
state <- rownames(arrests)
region <- state.region[match(state, state.name)]
analysis_K <- 3
region_cols <- c(
    "Northeast" = "#0072B2",
    "South" = "#D55E00",
    "North Central" = "#009E73",
    "West" = "#CC79A7"
)

head(data.frame(state = state, region = region, arrests, row.names = NULL))
#>        state region Murder Assault UrbanPop Rape
#> 1    Alabama  South   13.2     236       58 21.2
#> 2     Alaska   West   10.0     263       48 44.5
#> 3    Arizona   West    8.1     294       80 31.0
#> 4   Arkansas  South    8.8     190       50 19.5
#> 5 California   West    9.0     276       91 40.6
#> 6   Colorado   West    7.9     204       78 38.7

The examples below use K = 3 archetypes so that the resulting composition weights can be inspected on a two-dimensional simplex.

Using Archetypes in Recipes

step_archetypes() turns AA into a preprocessing step. During prep(), the step fits archetypes on the selected numeric predictors. During bake(), it projects data onto the learned archetype simplex and returns one composition column per archetype.

The recipes argument is called num_comp, following tidymodels naming conventions, but it is the same quantity as K.

rec_base <- recipe(~ ., data = arrests) |>
    step_normalize(all_numeric())

rec <- rec_base |>
    step_archetypes(
        all_numeric(),
        num_comp = analysis_K,
        seed = 42,
        options = list(
            init = "furthest_sum",
            max_iter = 100,
            tol_r2 = 0
        )
    )

rec_prep <- prep(rec, training = arrests)
rec_prep
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> predictor: 4
#> 
#> ── Training information
#> Training data contained 50 data points and no incomplete rows.
#> 
#> ── Operations
#> • Centering and scaling for: Murder, Assault, UrbanPop, Rape | Trained
#> • Archetypal Analysis Projection (K=3, delta=0): Murder, ... | Trained

baked <- bake(rec_prep, new_data = arrests)
head(data.frame(state = state, region = region, baked, row.names = NULL))
#>        state region         A1           A2           A3
#> 1    Alabama  South 0.35711557 9.999945e-09 6.428844e-01
#> 2     Alaska   West 0.27961708 3.303461e-01 3.900368e-01
#> 3    Arizona   West 0.15294787 5.772798e-01 2.697723e-01
#> 4   Arkansas  South 0.55200351 9.999964e-09 4.479965e-01
#> 5 California   West 0.00000001 1.000000e+00 1.000000e-08
#> 6   Colorado   West 0.19388960 8.061104e-01 9.999962e-09

aa_cols <- paste0("A", seq_len(analysis_K))
baked[, aa_cols] |> as.matrix() |> rowSums() |> range()
#> [1] 1 1

By default, the original numeric predictors are removed and replaced by composition columns named A1, A2, and so on. Set keep_original_cols = TRUE when both the normalized predictors and archetype weights should be retained.

rec_keep <- rec_base |>
    step_archetypes(
        all_numeric(),
        num_comp = analysis_K,
        keep_original_cols = TRUE,
        seed = 42,
        options = list(
            max_iter = 100,
            tol_r2 = 0
        )
    ) |>
    prep(training = arrests)

arrests_with_originals <- data.frame(
    state = state,
    region = region,
    bake(rec_keep, new_data = arrests),
    row.names = NULL
)

head(arrests_with_originals)
#>        state region     Murder   Assault   UrbanPop         Rape         A1
#> 1    Alabama  South 1.24256408 0.7828393 -0.5209066 -0.003416473 0.35711557
#> 2     Alaska   West 0.50786248 1.1068225 -1.2117642  2.484202941 0.27961708
#> 3    Arizona   West 0.07163341 1.4788032  0.9989801  1.042878388 0.15294787
#> 4   Arkansas  South 0.23234938 0.2308680 -1.0735927 -0.184916602 0.55200351
#> 5 California   West 0.27826823 1.2628144  1.7589234  2.067820292 0.00000001
#> 6   Colorado   West 0.02571456 0.3988593  0.8608085  1.864967207 0.19388960
#>             A2           A3
#> 1 9.999945e-09 6.428844e-01
#> 2 3.303461e-01 3.900368e-01
#> 3 5.772798e-01 2.697723e-01
#> 4 9.999964e-09 4.479965e-01
#> 5 1.000000e+00 1.000000e-08
#> 6 8.061104e-01 9.999962e-09

Comparing `step_archetypes()` with `step_pca()`

The closest tidymodels analogy for yaap is recipes::step_pca(): both steps learn an unsupervised representation of the numeric predictors during prep(). The outputs have different meanings. PCA returns orthogonal score columns such as PC1, PC2, and PC3; AA returns composition weights such as A1, A2, and A3 that are non-negative and sum to one for each state.

The two recipes below are trained in parallel on the same normalized arrest variables.

rec_aa <- rec_base |>
    step_archetypes(
        all_numeric(),
        num_comp = analysis_K,
        seed = 42,
        options = list(
            init = "furthest_sum",
            max_iter = 100,
            tol_r2 = 0
        )
    ) |>
    prep(training = arrests)

rec_pca <- rec_base |>
    step_pca(all_numeric(), num_comp = analysis_K) |>
    prep(training = arrests)

aa_scores <- bake(rec_aa, new_data = arrests)
pca_scores <- bake(rec_pca, new_data = arrests)

score_data <- data.frame(
    state = state,
    region = region,
    pca_scores,
    aa_scores,
    row.names = NULL
)
head(score_data)
#>        state region        PC1        PC2         PC3         A1           A2
#> 1    Alabama  South -0.9756604 -1.1220012  0.43980366 0.35711557 9.999945e-09
#> 2     Alaska   West -1.9305379 -1.0624269 -2.01950027 0.27961708 3.303461e-01
#> 3    Arizona   West -1.7454429  0.7384595 -0.05423025 0.15294787 5.772798e-01
#> 4   Arkansas  South  0.1399989 -1.1085423 -0.11342217 0.55200351 9.999964e-09
#> 5 California   West -2.4986128  1.5274267 -0.59254100 0.00000001 1.000000e+00
#> 6   Colorado   West -1.4993407  0.9776297 -1.08400162 0.19388960 8.061104e-01
#>             A3
#> 1 6.428844e-01
#> 2 3.900368e-01
#> 3 2.697723e-01
#> 4 4.479965e-01
#> 5 1.000000e-08
#> 6 9.999962e-09

The PCA step returns orthogonal score columns. The AA step returns convex composition weights over three estimated archetypal states. Inspecting both trained steps highlights the difference: PCA loadings describe directions of variation, while AA coordinates describe extreme profiles in the normalized feature space.

tidy(rec_pca, number = 2, type = "variance")
#> # A tibble: 16 × 4
#>    terms                         value component id       
#>    <chr>                         <dbl>     <int> <chr>    
#>  1 variance                      2.48          1 pca_cOyAJ
#>  2 variance                      0.990         2 pca_cOyAJ
#>  3 variance                      0.357         3 pca_cOyAJ
#>  4 variance                      0.173         4 pca_cOyAJ
#>  5 cumulative variance           2.48          1 pca_cOyAJ
#>  6 cumulative variance           3.47          2 pca_cOyAJ
#>  7 cumulative variance           3.83          3 pca_cOyAJ
#>  8 cumulative variance           4             4 pca_cOyAJ
#>  9 percent variance             62.0           1 pca_cOyAJ
#> 10 percent variance             24.7           2 pca_cOyAJ
#> 11 percent variance              8.91          3 pca_cOyAJ
#> 12 percent variance              4.34          4 pca_cOyAJ
#> 13 cumulative percent variance  62.0           1 pca_cOyAJ
#> 14 cumulative percent variance  86.8           2 pca_cOyAJ
#> 15 cumulative percent variance  95.7           3 pca_cOyAJ
#> 16 cumulative percent variance 100             4 pca_cOyAJ
tidy(rec_aa, number = 2)
#> # A tibble: 12 × 4
#>    archetype term      value id              
#>    <chr>     <chr>     <dbl> <chr>           
#>  1 A1        Murder   -1.28  archetypes_TtzXU
#>  2 A2        Murder    0.278 archetypes_TtzXU
#>  3 A3        Murder    1.75  archetypes_TtzXU
#>  4 A1        Assault  -1.47  archetypes_TtzXU
#>  5 A2        Assault   1.26  archetypes_TtzXU
#>  6 A3        Assault   1.97  archetypes_TtzXU
#>  7 A1        UrbanPop -2.32  archetypes_TtzXU
#>  8 A2        UrbanPop  1.76  archetypes_TtzXU
#>  9 A3        UrbanPop  0.999 archetypes_TtzXU
#> 10 A1        Rape     -1.07  archetypes_TtzXU
#> 11 A2        Rape      2.07  archetypes_TtzXU
#> 12 A3        Rape      1.14  archetypes_TtzXU

Next, the scores from both steps are plotted. The PCA scores are plotted in the usual Cartesian coordinate system. The AA composition weights are plotted on a ternary simplex, where each vertex corresponds to one archetype. The same states are labeled in both plots: states selected either because they are extreme on one of the first two principal-component axes or because they are closest to an archetype vertex.

pca_rng <- extendrange(c(score_plot_data$PC1, score_plot_data$PC2))
ggplot(score_plot_data, aes(PC1, PC2, colour = region)) +
    geom_hline(yintercept = 0, colour = "grey85") +
    geom_vline(xintercept = 0, colour = "grey85") +
    geom_point(size = 1.8) +
    geom_text(
        aes(label = label),
        na.rm = TRUE,
        nudge_y = diff(pca_rng) * 0.035,
        size = 3,
        show.legend = FALSE
    ) +
    coord_equal(xlim = pca_rng, ylim = pca_rng) +
    scale_colour_manual(values = region_cols, drop = FALSE) +
    labs(title = "step_pca()", x = "PC1", y = "PC2", colour = "Region") +
    theme_minimal(base_size = 10) +
    theme(panel.grid.minor = element_blank())

PCA scores for USArrests. Labels mark states selected as PCA extremes or AA simplex-vertex representatives.

ggtern(score_plot_data, aes(A1, A2, A3, colour = region)) +
    geom_point(size = 1.8) +
    geom_text(
        aes(label = label),
        na.rm = TRUE,
        position = "identity",
        size = 3,
        show.legend = FALSE
    ) +
    scale_colour_manual(values = region_cols, drop = FALSE) +
    labs(
        title = "step_archetypes()",
        x = "A1",
        y = "A2",
        z = "A3",
        colour = "Region"
    ) +
    theme_minimal(base_size = 10) +
    theme_showarrows() +
    theme(
        legend.position = "bottom",
        plot.margin = margin(8, 16, 8, 16),
        tern.panel.mask.show = FALSE
    )

AA composition weights for USArrests on the archetype simplex. Labels match the PCA plot for comparison.

Machine Learning Note

AA preprocessing returns scores that are barycentric coordinates: they are non-negative and sum to one for each observation. Tree models and other flexible learners can often use these scores directly. Linear, distance-based, or Gaussian models may be sensitive to the simplex constraint and the induced collinearity of the raw weights. In those settings, a typical transformation is the isometric log-ratio (ilr), which maps compositions from the simplex to ordinary Euclidean space. Log-ratio transforms require strictly positive compositions, so apply an appropriate zero-handling strategy first if any composition weights are zero. See package compositions for details on working with compositional data, including ilr and other transformations that can be used after bake() when a downstream model expects ordinary Euclidean predictors.

Declaring Tunable Parameters

step_archetypes() declares num_comp and delta as tunable recipe parameters. This lets the step participate in a larger tidymodels tuning workflow when the rest of the modeling pipeline supplies resampling, metrics, and an outcome.

rec_tune <- rec_base |>
    step_archetypes(
        all_numeric(),
        num_comp = tune(),
        delta = tune()
    )

tunable(rec_tune$steps[[2L]])
#> # A tibble: 2 × 5
#>   name     call_info        source component       component_id    
#>   <chr>    <list>           <chr>  <chr>           <chr>           
#> 1 num_comp <named list [3]> recipe step_archetypes archetypes_SetHB
#> 2 delta    <named list [3]> recipe step_archetypes archetypes_SetHB

For supervised modeling pipelines, use step_archetypes() inside the broader tidymodels workflow and tune num_comp alongside the downstream model parameters.