Abstract
In this vignette, we learn how to evaluate predictions on the ID
level with evaluate()
.
Contact the author at r-pkgs@ludvigolsen.dk
When we have groups of observations (e.g. a participant ID), we are sometimes more interested in the overall prediction for the group than those at the observation-level.
Say we have a dataset with 10 observations per participant and a model that predicts whether a participant has an autism diagnosis or not. While the model will predict each of the 10 observations, it’s really the overall prediction for the participant that we are interested in.
evaluate()
has two approaches to performing the
evaluation on the ID level: averaging and voting.
In averaging, we simply average the predicted probabilities for the participant. This is the default approach as it maintains information about how certain our model is about its class prediction. That is, if all observations have a 60% predicted probability of an autism diagnosis, that should be considered differently than 90%.
In voting, we simply count the predictions of each outcome class and assign the class with the most predictions to the participant.
If 7 out of 10 of the observations are predicted as having no autism diagnosis, that becomes the prediction for the participant.
We will use the simple participant.scores
dataset as it
has 3 rows per participant and a diagnosis column that we can evaluate
predictions against. Let’s add predicted probabilities and diagnoses and
have a look:
library(cvms)
library(knitr) # kable()
library(dplyr)
set.seed(74)
# Prepare dataset
data <- participant.scores %>% as_tibble()
# Add probabilities and predicted classes
data[["probability"]] <- runif(nrow(data))
data[["predicted diagnosis"]] <- ifelse(data[["probability"]] > 0.5, 1, 0)
data %>% head(10) %>% kable()
participant | age | diagnosis | score | session | probability | predicted diagnosis |
---|---|---|---|---|---|---|
1 | 20 | 1 | 10 | 1 | 0.7046162 | 1 |
1 | 20 | 1 | 24 | 2 | 0.4800045 | 0 |
1 | 20 | 1 | 45 | 3 | 0.1960176 | 0 |
2 | 23 | 0 | 24 | 1 | 0.9369707 | 1 |
2 | 23 | 0 | 40 | 2 | 0.8698302 | 1 |
2 | 23 | 0 | 67 | 3 | 0.2140318 | 0 |
3 | 27 | 1 | 15 | 1 | 0.0240853 | 0 |
3 | 27 | 1 | 30 | 2 | 0.8547959 | 1 |
3 | 27 | 1 | 40 | 3 | 0.7027153 | 1 |
4 | 21 | 0 | 35 | 1 | 0.9579817 | 1 |
We tell evaluate()
to aggregate the predictions by the
participant
column with the mean
(averaging)
method.
Note: It is assumed that the target class is constant within the IDs. I.e., that the participant has the same diagnosis in all observations.
ev <- evaluate(
data = data,
target_col = "diagnosis",
prediction_cols = "probability",
id_col = "participant",
id_method = "mean",
type = "binomial"
)
ev
#> # A tibble: 1 × 19
#> `Balanced Accuracy` Accuracy F1 Sensitivity Specificity `Pos Pred Value`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.292 0.3 0.364 0.333 0.25 0.4
#> # ℹ 13 more variables: `Neg Pred Value` <dbl>, AUC <dbl>, `Lower CI` <dbl>,
#> # `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection Rate` <dbl>,
#> # `Detection Prevalence` <dbl>, Prevalence <dbl>, Predictions <list>,
#> # ROC <named list>, `Confusion Matrix` <list>, Process <list>
The Predictions
column contains the averaged
predictions:
Target | Prediction | SD | Predicted Class | participant | id_method |
---|---|---|---|---|---|
1 | 0.4602128 | 0.2548762 | 0 | 1 | mean |
0 | 0.6736109 | 0.3994204 | 1 | 2 | mean |
1 | 0.5271988 | 0.4422946 | 1 | 3 | mean |
0 | 0.7974576 | 0.1703448 | 1 | 4 | mean |
1 | 0.5887699 | 0.4738221 | 1 | 5 | mean |
1 | 0.3526630 | 0.2302525 | 0 | 6 | mean |
1 | 0.2333758 | 0.1913763 | 0 | 7 | mean |
1 | 0.3956015 | 0.3207379 | 0 | 8 | mean |
0 | 0.3374361 | 0.0304785 | 0 | 9 | mean |
0 | 0.5988969 | 0.0675830 | 1 | 10 | mean |
Let’s plot the confusion matrix as well:
# Note: If ev had multiple rows, we would have to
# pass ev$`Confusion Matrix`[[1]] to
# plot the first row's confusion matrix
plot_confusion_matrix(ev)
We can have a better look at the metrics:
Balanced Accuracy | Accuracy | F1 | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | AUC | Lower CI |
---|---|---|---|---|---|---|---|---|
0.29167 | 0.3 | 0.36364 | 0.33333 | 0.25 | 0.4 | 0.2 | 0.20833 | 0 |
Upper CI | Kappa | MCC | Detection Rate | Detection Prevalence |
---|---|---|---|---|
0.62475 | -0.4 | -0.40825 | 0.2 | 0.5 |
We can use the majority
(voting) method for the ID
aggregation instead:
ev_2 <- evaluate(
data = data,
target_col = "diagnosis",
prediction_cols = "probability",
id_col = "participant",
id_method = "majority",
type = "binomial"
)
ev_2
#> # A tibble: 1 × 19
#> `Balanced Accuracy` Accuracy F1 Sensitivity Specificity `Pos Pred Value`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.292 0.3 0.364 0.333 0.25 0.4
#> # ℹ 13 more variables: `Neg Pred Value` <dbl>, AUC <dbl>, `Lower CI` <dbl>,
#> # `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection Rate` <dbl>,
#> # `Detection Prevalence` <dbl>, Prevalence <dbl>, Predictions <list>,
#> # ROC <named list>, `Confusion Matrix` <list>, Process <list>
Now the Predictions
column looks as follows:
Target | Prediction | Predicted Class | participant | id_method |
---|---|---|---|---|
1 | 0 | 0 | 1 | majority |
0 | 1 | 1 | 2 | majority |
1 | 1 | 1 | 3 | majority |
0 | 1 | 1 | 4 | majority |
1 | 1 | 1 | 5 | majority |
1 | 0 | 0 | 6 | majority |
1 | 0 | 0 | 7 | majority |
1 | 0 | 0 | 8 | majority |
0 | 0 | 0 | 9 | majority |
0 | 1 | 1 | 10 | majority |
In this case, the Predicted Class
column is identical to
that in the averaging approach. We just don’t have the probabilities to
tell us, how sure the model is about that prediction.
If we have predictions from multiple models, we can group the data frame and get the results per model.
Let’s duplicate the dataset and change the predictions. We then
combine the datasets and add a model
column for indicating
which of the data frames the observation came from:
# Duplicate data frame
data_2 <- data
# Change the probabilities and predicted classes
data_2[["probability"]] <- runif(nrow(data))
data_2[["predicted diagnosis"]] <- ifelse(data_2[["probability"]] > 0.5, 1, 0)
# Combine the two data frames
data_multi <- dplyr::bind_rows(data, data_2, .id = "model")
data_multi
#> # A tibble: 60 × 8
#> model participant age diagnosis score session probability
#> <chr> <fct> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 1 1 20 1 10 1 0.705
#> 2 1 1 20 1 24 2 0.480
#> 3 1 1 20 1 45 3 0.196
#> 4 1 2 23 0 24 1 0.937
#> 5 1 2 23 0 40 2 0.870
#> 6 1 2 23 0 67 3 0.214
#> 7 1 3 27 1 15 1 0.0241
#> 8 1 3 27 1 30 2 0.855
#> 9 1 3 27 1 40 3 0.703
#> 10 1 4 21 0 35 1 0.958
#> # ℹ 50 more rows
#> # ℹ 1 more variable: `predicted diagnosis` <dbl>
We can now group the data frame by the model
column and
run the evaluation again:
ev_3 <- data_multi %>%
dplyr::group_by(model) %>%
evaluate(
target_col = "diagnosis",
prediction_cols = "probability",
id_col = "participant",
id_method = "mean",
type = "binomial"
)
ev_3
#> # A tibble: 2 × 20
#> model `Balanced Accuracy` Accuracy F1 Sensitivity Specificity
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.292 0.3 0.364 0.333 0.25
#> 2 2 0.375 0.4 0.5 0.5 0.25
#> # ℹ 14 more variables: `Pos Pred Value` <dbl>, `Neg Pred Value` <dbl>,
#> # AUC <dbl>, `Lower CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>,
#> # `Detection Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>,
#> # Predictions <list>, ROC <named list>, `Confusion Matrix` <list>,
#> # Process <list>
The Predictions
for the second model looks as
follows:
model | Target | Prediction | SD | Predicted Class | participant | id_method |
---|---|---|---|---|---|---|
2 | 1 | 0.3302017 | 0.3002763 | 0 | 1 | mean |
2 | 0 | 0.6040242 | 0.2854935 | 1 | 2 | mean |
2 | 1 | 0.7342651 | 0.2653166 | 1 | 3 | mean |
2 | 0 | 0.6383918 | 0.3799305 | 1 | 4 | mean |
2 | 1 | 0.4551732 | 0.3417810 | 0 | 5 | mean |
2 | 1 | 0.6808281 | 0.3626166 | 1 | 6 | mean |
2 | 1 | 0.4536740 | 0.3784584 | 0 | 7 | mean |
2 | 1 | 0.6281501 | 0.4506029 | 1 | 8 | mean |
2 | 0 | 0.7000411 | 0.1490745 | 1 | 9 | mean |
2 | 0 | 0.4630344 | 0.4344227 | 0 | 10 | mean |
'gaussian'
evaluationThis kind of ID aggregation is also available for the
'gaussian'
evaluation (e.g. for linear regression models),
although only with the averaging approach. Again, it is assumed that the
target value is constant for all observations by a participant (like the
age
column in our dataset).
We add a predicted age
column to our initial
dataset:
We evaluate the predicted age, aggregated by participant:
ev_4 <- evaluate(
data = data,
target_col = "age",
prediction_cols = "predicted age",
id_col = "participant",
id_method = "mean",
type = "gaussian"
)
ev_4
#> # A tibble: 1 × 8
#> RMSE MAE `NRMSE(IQR)` RRSE RAE RMSLE Predictions Process
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
#> 1 10.3 8.7 0.984 1.48 1.45 0.340 <tibble [10 × 5]> <prcss_n_>
The Predictions
column looks as follows:
Target | Prediction | SD | participant | id_method |
---|---|---|---|---|
20 | 35.66667 | 8.326664 | 1 | mean |
23 | 33.33333 | 10.214369 | 2 | mean |
27 | 35.33333 | 5.686241 | 3 | mean |
21 | 30.00000 | 4.582576 | 4 | mean |
32 | 28.66667 | 5.507570 | 5 | mean |
31 | 43.33333 | 1.154700 | 6 | mean |
43 | 39.00000 | 5.196152 | 7 | mean |
21 | 40.33333 | 2.516611 | 8 | mean |
34 | 35.33333 | 5.507570 | 9 | mean |
32 | 35.33333 | 7.571878 | 10 | mean |
On average, we predict participant 1
to have the age
35.66
.
If our targets are not constant within the IDs, we might be interested in the ID-level evaluation. E.g. how well it predicted the score for each of the participants.
We add a predicted score
column to our dataset:
Now, we group the data frame by the participant
column
and evaluate the predicted scores:
data %>%
dplyr::group_by(participant) %>%
evaluate(
target_col = "score",
prediction_cols = "predicted score",
type = "gaussian"
)
#> # A tibble: 10 × 9
#> participant RMSE MAE `NRMSE(IQR)` RRSE RAE RMSLE Predictions Process
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
#> 1 1 13.8 13.7 0.787 0.957 1.10 0.683 <tibble> <prcss_n_>
#> 2 2 32.4 26.3 1.50 1.82 1.69 0.946 <tibble> <prcss_n_>
#> 3 3 12.8 10.7 1.03 1.25 1.2 0.549 <tibble> <prcss_n_>
#> 4 4 9.15 7.67 0.425 0.513 0.486 0.154 <tibble> <prcss_n_>
#> 5 5 24.1 17.3 1.27 1.47 1.15 0.566 <tibble> <prcss_n_>
#> 6 6 34.2 33.3 4.27 5.12 5.56 0.895 <tibble> <prcss_n_>
#> 7 7 44.7 40 2.98 3.45 3.33 1.21 <tibble> <prcss_n_>
#> 8 8 9.80 8 0.700 0.854 0.818 0.306 <tibble> <prcss_n_>
#> 9 9 22.6 21.3 1.37 1.66 1.81 0.447 <tibble> <prcss_n_>
#> 10 10 29.3 28 1.13 1.38 1.62 0.556 <tibble> <prcss_n_>
Participant 4
has the lowest prediction error while
participant 7
has the highest.
This approach is similar to what the most_challenging()
function does:
# Extract the ~20% observations with highest prediction error
most_challenging(
data = data,
type = "gaussian",
obs_id_col = "participant",
target_col = "score",
prediction_cols = "predicted score",
threshold = 0.20
)
#> # A tibble: 2 × 4
#> participant MAE RMSE `>=`
#> <fct> <dbl> <dbl> <dbl>
#> 1 7 40 44.7 32.7
#> 2 6 33.3 34.2 32.7
This concludes the vignette. If any elements are unclear you can leave feedback in a mail or in a GitHub issue :-)