library(rater)
#> * The rater package uses `Stan` to fit bayesian models.
#> * If you are working on a local, multicore CPU with excess RAM please call:
#> * options(mc.cores = parallel::detectCores())
#> * This will allow Stan to run inference on multiple cores in parallel.
{rater} allows the user to fit statistical models to repeated
categorical rating data. There are, however, different formats this
rating data can be arranged in. {rater} supports three of the most
common, which we call "long"
, "wide"
and
"grouped"
. This vignette explains:
The default format for data passed to the rater()
function is "long"
. This is the default format because it
is capable of representing rating data with incomplete designs (not
every rater rates each item) and with repeated ratings (raters ratings
items more than once). Long data is defined by having three columns:
in {rater}, for this data to be recognised, these columns must have
names "rater"
, "item"
and
"rating"
respectively. An example of long data in the
appropriate format for {rater} is given illustrated below:
item | rater | rating |
---|---|---|
1 | 1 | 3 |
1 | 2 | 4 |
2 | 1 | 2 |
2 | 2 | 2 |
3 | 1 | 2 |
3 | 2 | 2 |
We can read the first row, for example, of the dataset as saying that
the item 1 was rated a 3 by rater 1. Repeated ratings are represented by
rows with the same rater and item (and possibly rating) combination and
missing data is represented simply by the absence of certain rater and
item combinations. Long data is useful because it can represent
all possible categorical rating data. The
anesthesia
data included with {rater} (which includes
repeated ratings) is represented in this format.
To illustrate the differences between formats, the data used in the long data example will be used also presented in the wide and grouped formats below.
The next data format which can be used in rater()
is the
wide format. In wide format data each column corresponds to the ratings
of a particular rater, each row is an item and the entries of the
corresponding table are the ratings themselves. For example the
following table presents the previous long data example in wide
format:
rater_1 | rater_2 |
---|---|
3 | 4 |
2 | 2 |
2 | 2 |
This format is natural if there are no repeated ratings, which cannot
be represented in this format. Missing data can be represented by
explicit NA
entries in the data. In rater()
this format can be used by setting data_format = "wide"
.
Internally this simply converts the data to long format; there is no
computational advantage to using wide data.
Note that when wide data is passed to rater()
any column
names (i.e. rater_1
and rater_2
above) will be
ignored and the raters will be numbered as they appear in the data left
to right. In future ‘labelling’ of the rater may be supported in which
case the columns names will be interpreted as the labels.
The final format of data supported by {rater} is the grouped format. This format can be thought of as an extension of the wide format where rating ‘patterns’ which occurs multiple times are collapsed together, while a new column is added to represent how many times each pattern occurred in the original data. For example in the running data example the pattern of both raters giving the rating 2 occurs twice, while the pattern of rater 1 giving the rating a 3 and rater 2 giving the rating 4 occurs once. This is illustrated in the grouped data representation of the example data below:
rater_1 | rater_2 | n |
---|---|---|
3 | 4 | 1 |
2 | 2 | 2 |
Here the column n
represents the number of times each
pattern occurs. rater()
requires that a column named
n
is the right most column in grouped data and will
interpret the remaining columns as for wide data. Currently grouped data
passed to rater()
cannot contain any missing values. The
caries
data included in the package is in the {rater}
grouped data format.
The grouped format can only represent the same data as wide format, but it is still useful. This is because using grouped data allows a different from of the likelihood of the statistical models implemented in rater to be used, which can greatly speed up model fitting. If the number of patterns in the data is much less than the number of item by rater combinations (i.e. the number of rows in the long format) then using grouped data can lead to large speed-ups. Currently this re-writing of the likelihood is only available for the Dawid-Skene model, not any of the extensions implemented in the package.