Covariates can be used to extend standard bage models in two ways:
This vignette gives a brief description of covariates in bage and then illustrates their use with a case study of births in South Korea.
A model in bage typically has a structure that is something like: \[\begin{align} y_{ast} & \sim \text{Poisson}(\gamma_{ast} w_{ast}) \\ \log \gamma_{ast} & \sim \text{Gamma}(\xi^{-1}, (\mu_{ast} \xi)^{-1}) \\ \log \mu_{ast} & = \beta^{(0)} + \beta_{as}^{\text{age:sex}} + \beta_t^{\text{time}} \tag{2.1} \end{align}\] where
In a model with covariates, (2.1) changes to \[\begin{equation} \log \mu_{ast} = \beta^{(0)} + \beta_{as}^{\text{age:sex}} + \beta_t^{\text{time}} + (\pmb{Z} \pmb{\zeta})_{ast} \end{equation}\]
where
Change (2.2) to \[\begin{equation} \mu_i = \sum_{m=0}^M \beta_{j_i^m}^{(m)} + (\pmb{Z} \pmb{\eta})_i \tag{2.2} \end{equation}\]
where - \(\pmb{Z}\) is an \(I \times P\) matrix of covariates; and - \(\pmb{\eta}\) is a vector of coefficients.
The covariate matrix \(\pmb{Z}\) is derived from the raw covariate data by scaling any numeric variables to have mean 0 and standard deviation 1, and by converting any categorical variables to sets of indicator variables. The conversion to indicator variables follows the rules that R uses for “treatment” contrasts. If the categorical has \(C\) categories, then \(C-1\) indicator variabls are constructed, with the first category being omitted.
Each element of \(\pmb{\eta}\) has prior \[\begin{equation} \eta_p \sim \text{N}(0, 1) \end{equation}\]
To illustrate the use of covariates, we will analyse data on births in South Korea.
Besides bage itself, we use dplyr and vctrs for data manipulation, and ggplot2 for plotting results.
Our data is a subset of the the data frame kor_births, which is part of bage.
The data frame gives numbers of births disaggregated by age of mother, region, and year. It also contains a numeric variable called gdp_pc_2023 that gives GDP per capita (in thousands of US dollars) in 2023, and a categorical variable (with levels "Low", "Medium", and "High") describing population density in 2020.
births
#> # A tibble: 585 × 7
#>    age   region  time births  popn gdp_pc_2023 dens_2020
#>    <chr> <fct>  <int>  <int> <int>       <dbl> <chr>    
#>  1 10-14 Busan   2011      1 89822        25.7 High     
#>  2 10-14 Busan   2012      0 83884        25.7 High     
#>  3 10-14 Busan   2013      0 79061        25.7 High     
#>  4 10-14 Busan   2014      1 74741        25.7 High     
#>  5 10-14 Busan   2015      0 68783        25.7 High     
#>  6 10-14 Busan   2016      1 64905        25.7 High     
#>  7 10-14 Busan   2017      1 64251        25.7 High     
#>  8 10-14 Busan   2018      0 63249        25.7 High     
#>  9 10-14 Busan   2019      0 62154        25.7 High     
#> 10 10-14 Busan   2020      0 63498        25.7 High     
#> # ℹ 575 more rowsThe variables gdp_pc_2023 and dens_2020 are both examples of covariates that contribute extra information to the model, beyond what is contained in the outcome, exposure, and classifying variables.
We use function set_covariates() to instruct mod_pois() to treat these variables as covariates.
mod_gdp_dens <- mod_pois(births ~ (age + region + time)^2,
                         data = births,
                         exposure = popn) |>
  set_covariates(~ gdp_pc_2023 + dens_2020) |>
  fit()
#> Building log-posterior function...
#> Finding maximum...
#> Drawing values for hyper-parameters...
mod_gdp_dens
#> 
#>     ------ Fitted Poisson model ------
#> 
#>    births ~ (age + region + time)^2
#> 
#>                  exposure: popn
#> 
#>         term  prior along n_par n_par_free std_dev
#>  (Intercept) NFix()     -     1          1       -
#>          age   RW()   age     9          9    2.70
#>       region    N()     -    16         16    0.21
#>         time   RW()  time    13         13    0.20
#>   age:region   RW()   age   144        144    0.22
#>     age:time   RW()  time   117        117    1.14
#>  region:time   RW()  time   208        208    0.23
#> 
#>  covariates: ~gdp_pc_2023 + dens_2020
#> 
#>  disp: mean = 1
#> 
#>  n_draw var_time var_age optimizer
#>    1000     time     age    nlminb
#> 
#>  time_total time_max time_draw iter converged                    message
#>        0.41     0.24      0.15   30      TRUE   relative convergence (4)To obtain estimates of the coefficients (ie estimates of the \(\zeta_p\)) we call function components() and filter out rows for the "covariates" term.
In East Asia, years with the Dragon zodiac sign sometimes have larger-than-usual numbers of births. To allow for this possibility, we create a dragon-year covariate, and incorporate it into a new model.
births <- births |>
  mutate(is_dragon_year = time == 2012)
mod_dragon <- mod_pois(births ~ (age + region + time)^2,
                      data = births,
                      exposure = popn) |>
  set_covariates(~ is_dragon_year) |>
  fit()
#> Building log-posterior function...
#> Finding maximum...
#> Drawing values for hyper-parameters...
mod_dragon |>
  components() |>
  filter(term == "covariates")
#> # A tibble: 1 × 4
#>   term       component level                           .fitted
#>   <chr>      <chr>     <chr>                      <rdbl<1000>>
#> 1 covariates coef      is_dragon_yearTRUE 0.065 (-0.047, 0.18)There is some evidence for extra births, though there is substantial uncertainty about the size of the effect.
Next we expand the model to allow the dragon-year effect to differ across age groups. We create a variable that takes the values "baseline" in all years, except in dragon years, when it takes the name of the age group. We turn this variable into a factor with "baseline" as its first level.
births <- births |>
  mutate(is_dragon_year_age = if_else(time == 2012, age, "baseline"),
         is_dragon_year_age = factor(is_dragon_year_age, 
                                     levels = c("baseline", unique(age))))
births |>
  filter(time %in% 2011:2013)
#> # A tibble: 135 × 9
#>    age   region   time births  popn gdp_pc_2023 dens_2020 is_dragon_year
#>    <chr> <fct>   <int>  <int> <int>       <dbl> <chr>     <lgl>         
#>  1 10-14 Busan    2011      1 89822        25.7 High      FALSE         
#>  2 10-14 Busan    2012      0 83884        25.7 High      TRUE          
#>  3 10-14 Busan    2013      0 79061        25.7 High      FALSE         
#>  4 10-14 Daegu    2011      3 75776        22.3 Medium    FALSE         
#>  5 10-14 Daegu    2012      2 70399        22.3 Medium    TRUE          
#>  6 10-14 Daegu    2013      0 66858        22.3 Medium    FALSE         
#>  7 10-14 Gwangju  2011      1 53033        26.0 Medium    FALSE         
#>  8 10-14 Gwangju  2012      3 49716        26.0 Medium    TRUE          
#>  9 10-14 Gwangju  2013      0 46918        26.0 Medium    FALSE         
#> 10 10-14 Incheon  2011      0 83927        29.3 Medium    FALSE         
#> # ℹ 125 more rows
#> # ℹ 1 more variable: is_dragon_year_age <fct>We create a new model with the age-sepcific dragon-year indicator.
mod_dragon_age <- mod_pois(births ~ (age + region + time)^2,
                         data = births,
                         exposure = popn) |>
  set_covariates(~ is_dragon_year_age) |>
  fit()
#> Building log-posterior function...
#> Finding maximum...
#> Drawing values for hyper-parameters...
mod_dragon_age
#> 
#>     ------ Fitted Poisson model ------
#> 
#>    births ~ (age + region + time)^2
#> 
#>                  exposure: popn
#> 
#>         term  prior along n_par n_par_free std_dev
#>  (Intercept) NFix()     -     1          1       -
#>          age   RW()   age     9          9    2.71
#>       region    N()     -    16         16    0.16
#>         time   RW()  time    13         13    0.18
#>   age:region   RW()   age   144        144    0.23
#>     age:time   RW()  time   117        117    1.16
#>  region:time   RW()  time   208        208    0.22
#> 
#>  covariates: ~is_dragon_year_age
#> 
#>  disp: mean = 1
#> 
#>  n_draw var_time var_age optimizer
#>    1000     time     age    nlminb
#> 
#>  time_total time_max time_draw iter converged                    message
#>        0.35     0.17      0.14   22      TRUE   relative convergence (4)Rather than a single dragon-year coefficient, we have a coefficient for each age group. We extract them and tidy up the labels.
mod_dragon_age |>
  components() |>
  filter(term == "covariates") |>
  mutate(age = sub("is_dragon_year_age", "", level)) |>
  select(age, .fitted)
#> # A tibble: 9 × 2
#>   age               .fitted
#>   <chr>        <rdbl<1000>>
#> 1 10-14  0.59 (-0.053, 1.2)
#> 2 15-19 -0.011 (-0.22, 0.2)
#> 3 20-24 0.043 (-0.14, 0.24)
#> 4 25-29 0.064 (-0.12, 0.25)
#> 5 30-34 0.085 (-0.12, 0.27)
#> 6 35-39 0.048 (-0.15, 0.23)
#> 7 40-44 0.075 (-0.12, 0.26)
#> 8 45-49 0.096 (-0.17, 0.37)
#> 9 50-54  0.33 (-0.25, 0.86)If all the covariates in a model are fixed, then the model can be forecasted as normal.
mod_gdp_dens |>
  forecast(labels = 2024:2025)
#> `components()` for past values...
#> `components()` for future values...
#> `augment()` for future values...
#> # A tibble: 90 × 10
#>    age   region   time births  popn gdp_pc_2023 dens_2020 .observed
#>    <chr> <fct>   <int>  <dbl> <int>       <dbl> <chr>         <dbl>
#>  1 10-14 Busan    2024     NA    NA        25.7 High             NA
#>  2 10-14 Busan    2025     NA    NA        25.7 High             NA
#>  3 10-14 Daegu    2024     NA    NA        22.3 Medium           NA
#>  4 10-14 Daegu    2025     NA    NA        22.3 Medium           NA
#>  5 10-14 Gwangju  2024     NA    NA        26.0 Medium           NA
#>  6 10-14 Gwangju  2025     NA    NA        26.0 Medium           NA
#>  7 10-14 Incheon  2024     NA    NA        29.3 Medium           NA
#>  8 10-14 Incheon  2025     NA    NA        29.3 Medium           NA
#>  9 10-14 Seoul    2024     NA    NA        43.4 High             NA
#> 10 10-14 Seoul    2025     NA    NA        43.4 High             NA
#> # ℹ 80 more rows
#> # ℹ 2 more variables: .fitted <rdbl<1000>>, .expected <rdbl<1000>>If, however, a covariate varies over time, forecasting only works if values for future periods are provided. The following code will result in an error:
mod_dragon |>
  forecast(labels = 2024:2025)Instead we need to create a newdata data frame…
newdata <- expand.grid(age = unique(kor_births$age),
                       region = unique(kor_births$region),
                       time = 2024:2025) |>
  mutate(is_dragon_year = FALSE)
head(newdata)
#>     age region time is_dragon_year
#> 1 10-14  Busan 2024          FALSE
#> 2 15-19  Busan 2024          FALSE
#> 3 20-24  Busan 2024          FALSE
#> 4 25-29  Busan 2024          FALSE
#> 5 30-34  Busan 2024          FALSE
#> 6 35-39  Busan 2024          FALSE…and supply it to forecast().
mod_dragon |>
  forecast(newdata = newdata)
#> `components()` for past values...
#> `components()` for future values...
#> `augment()` for future values...
#> # A tibble: 288 × 11
#>    age   region           time births  popn gdp_pc_2023 dens_2020 is_dragon_year
#>    <chr> <fct>           <int>  <dbl> <int>       <dbl> <chr>     <lgl>         
#>  1 10-14 Busan            2024     NA    NA          NA <NA>      FALSE         
#>  2 15-19 Busan            2024     NA    NA          NA <NA>      FALSE         
#>  3 20-24 Busan            2024     NA    NA          NA <NA>      FALSE         
#>  4 25-29 Busan            2024     NA    NA          NA <NA>      FALSE         
#>  5 30-34 Busan            2024     NA    NA          NA <NA>      FALSE         
#>  6 35-39 Busan            2024     NA    NA          NA <NA>      FALSE         
#>  7 40-44 Busan            2024     NA    NA          NA <NA>      FALSE         
#>  8 45-49 Busan            2024     NA    NA          NA <NA>      FALSE         
#>  9 50-54 Busan            2024     NA    NA          NA <NA>      FALSE         
#> 10 10-14 Chungcheongbuk…  2024     NA    NA          NA <NA>      FALSE         
#> # ℹ 278 more rows
#> # ℹ 3 more variables: .observed <dbl>, .fitted <rdbl<1000>>,
#> #   .expected <rdbl<1000>>