01: Understanding and Preparing Your Event Data

Overview

The event and population data are at the core of the BYM-based models used in the RSTr package. They work alongside the adjacency information to generate smoothed estimates. In this vignette, we’ll discuss requirements for event and population data and walk through an example with a data.frame.

Requirements

Example: CDC WONDER dataset

To walk through the data setup from a data.frame to the final array list, we will use data generated by CDC WONDER’s Underlying Cause of Death Compressed Mortality, ICD-9 database, found at https://wonder.cdc.gov/cmf-icd9.html:

head(maexample)
#>   Notes Year Year.Code                County County.Code    Sex Sex.Code Deaths
#> 1       1979      1979 Barnstable County, MA       25001 Female        F     15
#> 2       1979      1979 Barnstable County, MA       25001   Male        M     57
#> 3       1979      1979  Berkshire County, MA       25003 Female        F     11
#> 4       1979      1979  Berkshire County, MA       25003   Male        M     63
#> 5       1979      1979    Bristol County, MA       25005 Female        F     52
#> 6       1979      1979    Bristol County, MA       25005   Male        M    191
#>   Population        Crude.Rate
#> 1      25239 59.4 (Unreliable)
#> 2      21261             268.1
#> 3      24884 44.2 (Unreliable)
#> 4      22465             280.4
#> 5      80171              64.9
#> 6      71943             265.5

Our example dataset contains acute myocardial infarction (ICD-9: 410) mortality and population data in all counties of Massachusetts for men and women aged 35-64 from 1979 to 1981. This dataset also includes some notes in the bottom rows describing the dataset. maexample contains several variables:

The first thing we want to do with our dataset is remove the notes from the bottom rows - while they are useful for getting acquainted with the dataset, they will ultimately mess up our population arrays. Since Year does not have information in rows with notes, we can use that to filter our data:

ma_mort <- maexample[which(!is.na(maexample$Year)), ]

The above code searches for values in maexample$Year that aren’t NA and creates a new dataset containing only those rows. Before we start generating our arrays, let’s take stock of how our data is listed out:

head(ma_mort)
#>   Notes Year Year.Code                County County.Code    Sex Sex.Code Deaths
#> 1       1979      1979 Barnstable County, MA       25001 Female        F     15
#> 2       1979      1979 Barnstable County, MA       25001   Male        M     57
#> 3       1979      1979  Berkshire County, MA       25003 Female        F     11
#> 4       1979      1979  Berkshire County, MA       25003   Male        M     63
#> 5       1979      1979    Bristol County, MA       25005 Female        F     52
#> 6       1979      1979    Bristol County, MA       25005   Male        M    191
#>   Population        Crude.Rate
#> 1      25239 59.4 (Unreliable)
#> 2      21261             268.1
#> 3      24884 44.2 (Unreliable)
#> 4      22465             280.4
#> 5      80171              64.9
#> 6      71943             265.5

RSTr offers a long_to_list_matrix() function which can transform this dataset into mortality and population arrays with properly oriented margins:

ma_data <- long_to_list_matrix(ma_mort, Deaths, Population, County.Code, Sex.Code, Year.Code)

If you want to manually set up the data, you can create Y and n arrays using the xtabs() function and consolidate them into a list to be used with the model:

Y <- xtabs(Deaths ~ County.Code + Sex.Code + Year.Code, data = ma_mort)
n <- xtabs(Population ~ County.Code + Sex.Code + Year.Code, data = ma_mort)
ma_data <- list(Y = Y, n = n)

Note that you must specify the names of each array element as above, as creating a list with just the objects will not name each element, and the names Y and n are necessary for RSTr to know how to use the data.

If you have multiple types of groups, such as race and sex, it can take a little finessing to set up your group data, such as creating a combined race-sex group variable, but data setup will follow the same principles as above.

Data setup for other models

The above dataset is prepared specifically for an MSTCAR model. But what if we want to prepare data for an MCAR or even a CAR model? We can filter the original dataset and follow a similar procedure to prepare our data for the MCAR model:

ma_mort_mcar <- ma_mort[ma_mort$Year == 1979, ] # filter dataset to only show 1979 data
ma_data_mcar <- long_to_list_matrix(ma_mort_mcar, Deaths, Population, County.Code, Sex.Code)

Note that xtabs() works by aggregating data along the specified variables in the expression argument. In the case of the MCAR model, we filter down to the year we want because otherwise, it would give us the mortality and population counts for all years in our dataset instead of just for 1979.

For the CAR model, setup is similar:

ma_mort_car <- ma_mort[ma_mort$Year == 1979 & ma_mort$Sex == "Male", ] # filter dataset to only show 1979 data for men
ma_data_car <- long_to_list_matrix(ma_mort_car, Deaths, Population, County.Code)

Closing Thoughts

In this vignette, we used data generated from CDC WONDER to construct our event and population counts, remove unnecessary rows using filter(), and construct our list using long_to_list_matrix(). Setting up the data for RSTr can seem daunting at first, but with a few quick tricks in R, it can be easy to have your data organized for analysis.