Introduction to cito

Christian Amesoeder & Maximilian Pichler

2023-09-29

Abstract

‘cito’ allows you to build and train neural networks using the R formula syntax. It relies on the ‘torch’ package for numerical computations and optional graphic card support.

Setup - Installing torch

Before using ‘cito’ make sure that the current version of ‘torch’ is installed and running.

if(!require(torch)) install.packages("torch")
#> Loading required package: torch
library(torch)
if(!torch_is_installed()) install_torch()

library (cito)

If you have problems installing Torch, check out the installation help from the torch developer.

Introduction to models and model structures

Loss functions / Likelihoods

Cito can handle many different response types. Common loss functions from ML but also likelihoods for statistical models are supported:

Name Explanation Example / Task Meta-Code
mse mean squared error Regression, predicting continuous values dnn(Sepal.Length~., ..., loss = "mse")
mae mean absolute error Regression, predicting continuous values dnn(Sepal.Length~., ..., loss = "msa")
softmax categorical cross entropy Multi-class, species classification dnn(Species~., ..., loss = "softmax")
cross-entropy categorical cross entropy Multi-class, species classification dnn(Species~., ..., loss = "cross-entropy")
gaussian Normal likelihood Regression, residual error is also estimated (similar to stats::lm()) dnn(Sepal.Length~., ..., loss = "gaussian")
binomial Binomial likelihood Classification/Logistic regression, mortality (0/1 data) dnn(Presence~., ..., loss = "Binomial")
poisson Poisson likelihood Regression, count data, e.g. species abundances dnn(Abundance~., ..., loss = "Poisson")

Moreover, all non multilabel losses (all except for softmax or cross-entropy) can be modeled as multilabel using the cbind syntax

dnn(cbind(Sepal.Length, Sepal.Width)~., …, loss = "mse")

The likelihoods (Gaussian, Binomial, and Poisson) can be also passed as their stats equivalents:

dnn(Sepal.Length~., ..., loss = stats::gaussian)

Data

In this vignette, we will work with the irirs dataset and build a regression model.

data <- datasets::iris
head(data)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

#scale dataset
data <- data.frame(scale(data[,-5]),Species = data[,5])

Fitting a simple model

In ‘cito’, neural networks are specified and fitted with the dnn() function. Models can also be trained on the GPU by setting device = "cuda"(but only if you have installed the CUDA dependencies). This is suggested if you are working with large data sets or networks.

library(cito)

#fitting a regression model to predict Sepal.Length
nn.fit <- dnn(Sepal.Length~. , data = data, epochs = 12, loss = "mse", verbose=FALSE)

You can plot the network structure to give you a visual feedback of the created object. e aware that this may take some time for large networks.

plot(nn.fit)
plot of chunk plotnn

plot of chunk plotnn

The neural network 5 input nodes (3 continoues features, Sepal.Width, Petal.Length, Petal.Width and the contrasts for the Species variable (n_classes - 1)) and 1 output node for the response (Sepal.Length).

Baseline loss

At the start of the training we calculate a baseline loss for an an intercept only model. It allows us to control the training because the goal is to beat the baseline loss. If we don’t, we need to adjust the optimization parameters (epochs and lr (learning rate)):

nn.fit <- dnn(Sepal.Length~. , data = data, epochs = 50, lr = 0.6, loss = "mse", verbose = FALSE) # lr too high
#> Error in eval_bare(loop, env): Loss is NA. Bad training, please check learning rate or regularization strength. See vignette('02_Troubleshooting') for help.
plot of chunk unnamed-chunk-2

plot of chunk unnamed-chunk-2

vs

nn.fit <- dnn(Sepal.Length~. , data = data, epochs = 50, lr = 0.01, loss = "mse", verbose = FALSE)
plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-3

See vignette("B-Training_neural_networks") for more details on how to adjust the optimization procedure and increase the probability of convergence.

Adding a validation set to the training process

In order to see where your model might suffer from overfitting the addition of a validation set can be useful. With dnn() you can put validation = 0.x and define a percentage that will not be used for training and only for validation after each epoch. During training, a loss plot will show you how the two losses behave (see vignette("B-Training_neural_networks") for details on training NN and guaranteeing their convergence).

#20% of data set is used as validation set
nn.fit <- dnn(Sepal.Length~., data = data, epochs = 32,
              loss= "mse", validation = 0.2)

Weights oft the last and the epoch with the lowest validation loss are saved:

length(nn.fit$weights)
#> [1] 2

The default is to use the weights of the last epoch. But we can also tell the model to use the weights with the lowest validation loss:

nn.fit$use_model_epoch = 1 # Default, use last epoch
nn.fit$use_model_epoch = 2 # Use weights from epoch with lowest validation loss

Methods

cito supports many of the well-known methods from other statistical packages:

predict(nn.fit)
coef(nn.fit)
print(nn.fit)

Explainable AI - Understanding your model

xAI can produce outputs that are similar to known outputs from statistical models:

xAI Method Explanation Statistical equivalent
Feature importance (returned by summary(nn.fit)) Feature importance based on permutations. See Fisher, Rudin, and Dominici (2018) Similar to how much variance in the data is explained by the features. anova(model)
Average conditional effects (ACE) (returned by summary(nn.fit)) Average of local derivatives, approximation of linear effects. See Pichler and Hartig (2023) lm(Y~X)
Standard Deviation of Conditional Effects (SDCE) (returned by summary(nn.fit)) Standard deviation of the average conditional effects. Correlates with the non-linearity of the effects.
Partial dependency plots (PDP(nn.fit)) Visualization of the response-effect curve. plot(allEffects(model))
Accumulated local effect plots (ALE(nn.fit)) Visualization of the response-effect curve. More robust against collinearity compared to PDPs plot(allEffects(model))

The summary() returns feature importance, ACE and SDCE:

# Calculate and return feature importance
summary(nn.fit)
#> Summary of Deep Neural Network Model
#> 
#> Feature Importance:
#>       variable importance_1
#> 1  Sepal.Width     1.133617
#> 2 Petal.Length     2.286806
#> 3  Petal.Width     2.783383
#> 4      Species     1.053672
#> 
#> Average Conditional Effects:
#>              Response_1
#> Sepal.Width  0.06250046
#> Petal.Length 0.35557117
#> Petal.Width  0.39397602
#> 
#> Standard Deviation of Conditional Effects:
#>              Response_1
#> Sepal.Width  0.08405767
#> Petal.Length 0.08007347
#> Petal.Width  0.06509943
#returns weights of neural network
coef(nn.fit)

Uncertainties/p-Values

We can use bootstrapping to obtain uncertainties for the xAI metrics (and also for the predictions). For that we have to retrain our model with enabled bootstrapping:

df = data
df[,2:4] = scale(df[,2:4]) # scaling can help the NN to convergence faster
nn.fit <- dnn(Sepal.Length~., data = df,
              epochs = 100,
              verbose = FALSE,
              loss= "mse",
              bootstrap = 30L
              )
summary(nn.fit)
#> Summary of Deep Neural Network Model
#> 
#> ── Feature Importance
#>                             Importance Std.Err Z value Pr(>|z|)   
#> Sepal.Width → Sepal.Length       1.084   0.353    3.07   0.0022 **
#> Petal.Length → Sepal.Length      6.778   2.649    2.56   0.0105 * 
#> Petal.Width → Sepal.Length       1.075   0.496    2.17   0.0301 * 
#> Species → Sepal.Length           0.209   0.119    1.76   0.0790 . 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> ── Average Conditional Effects
#>                                ACE Std.Err Z value Pr(>|z|)    
#> Sepal.Width → Sepal.Length  0.2761  0.0578    4.77  1.8e-06 ***
#> Petal.Length → Sepal.Length 0.7587  0.1142    6.64  3.1e-11 ***
#> Petal.Width → Sepal.Length  0.2448  0.0857    2.86   0.0043 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> ── Standard Deviation of Conditional Effects
#>                                ACE Std.Err Z value Pr(>|z|)    
#> Sepal.Width → Sepal.Length  0.1161  0.0260    4.46  8.1e-06 ***
#> Petal.Length → Sepal.Length 0.2134  0.0823    2.59  0.00953 ** 
#> Petal.Width → Sepal.Length  0.1021  0.0269    3.80  0.00015 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We find that Petal.Length and Sepal.Width are significant (categorical features are not supported yet for average conditional effects).

Let’s compare the output to statistical outputs:

anova(lm(Sepal.Length~., data = df))
#> Analysis of Variance Table
#> 
#> Response: Sepal.Length
#>               Df  Sum Sq Mean Sq  F value    Pr(>F)    
#> Sepal.Width    1   2.060   2.060  15.0011 0.0001625 ***
#> Petal.Length   1 123.127 123.127 896.8059 < 2.2e-16 ***
#> Petal.Width    1   2.747   2.747  20.0055 1.556e-05 ***
#> Species        2   1.296   0.648   4.7212 0.0103288 *  
#> Residuals    144  19.770   0.137                       
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Feature importance and the anova report Petal.Length as the most important feature.

Visualization of the effects:

PDP(nn.fit)
plot of chunk unnamed-chunk-9

plot of chunk unnamed-chunk-9

library(effects)
#> Loading required package: carData
#> lattice theme set by effectsTheme()
#> See ?effectsTheme for details.
plot(allEffects(lm(Sepal.Length~., data = df)))
plot of chunk unnamed-chunk-10

plot of chunk unnamed-chunk-10

There are some differences between the statistical model and the NN - which is to be expected because the NN can fit the data more flexible. But at the same time the differences have large confidence intervals (e.g. the effect of Petal.Width)

Architecture

The architecture in NN usually refers to the width and depth of the hidden layers (the layers between the input and the output layer) and their activation functions. You can increase the complexity of the NN by adding layers and/or making them wider:

# "simple NN" - low complexity
nn.fit <- dnn(Sepal.Length~., data = data, epochs = 100,
              loss= "mse", validation = 0.2,
              hidden = c(5L), verbose=FALSE)
plot of chunk unnamed-chunk-11

plot of chunk unnamed-chunk-11


# "large NN" - high complexity
nn.fit <- dnn(Sepal.Length~., data = data, epochs = 100,
              loss= "mse", validation = 0.2,
              hidden = c(100L, 100), verbose=FALSE)
plot of chunk unnamed-chunk-11

plot of chunk unnamed-chunk-11

There is no definitive guide to choosing the right architecture for the right task. However, there are some general rules/recommendations: In general, wider, and deeper neural networks can improve generalization - but this is a double-edged sword because it also increases the risk of overfitting. So, if you increase the width and depth of the network, you should also add regularization (e.g., by increasing the lambda parameter, which corresponds to the regularization strength). Furthermore, in Pichler & Hartig, 2023, we investigated the effects of the hyperparameters on the prediction performance as a function of the data size. For example, we found that the selu activation function outperforms relu for small data sizes (<100 observations).

We recommend starting with moderate sizes (like the defaults), and if the model doesn’t generalize/converge, try larger networks along with a regularization that helps to minimize the risk of overfitting (see vignette("B-Training_neural_networks") ).

Activation functions

By default, all layers are fitted with SeLU as activation function. \[ relu(x) = max (0,x) \]You can also adjust the activation function of each layer individually to build exactly the network you want. In this case you have to provide a vector the same length as there are hidden layers. The activation function of the output layer is chosen with the loss argument and does not have to be provided.

#selu as activation function for all layers:
nn.fit <- dnn(Sepal.Length~., data = data, hidden = c(10,10,10,10), activation= "relu")
#layer specific activation functions:
nn.fit <- dnn(Sepal.Length~., data = data,
              hidden = c(10,10,10,10), activation= c("relu","selu","tanh","sigmoid"))

Note: The default activation function should be adequate for most tasks. We don’t recommend tuning it.

Training hyperparameters

Regularization

Elastic net regularization

If elastic net is used, ‘cito’ will produce a sparse, generalized neural network. The L1/L2 loss can be controlled with the arguments alpha and lambda.

\[ loss = \lambda * [ (1 - \alpha) * |weights| + \alpha |weights|^2 ] \]

#elastic net penalty in all layers:
nn.fit <- dnn(Species~., data = data, alpha = 0.5, lambda = 0.01, verbose=FALSE, loss = "softmax")

Dropout Regularization

Dropout regularization as proposed in Srivastava et al. can be controlled similar to elastic net regularization. In this approach, a percentage of different nodes gets left during each epoch.

#dropout of 35% on all layers:
nn.fit <- dnn(Species~., data = data, loss = "softmax", dropout = 0.35, verbose=FALSE)
#dropout of 35% only on last 2 layers:
nn.fit <- dnn(Species~., data = data, loss = "softmax", dropout = c(0, 0, 0.35, 0.35), verbose=FALSE)

Learning rate

Learning rate scheduler

Learning rate scheduler allow you to start with a high learning rate and decrease it during the training process. This leads to an overall faster training. You can choose between different types of schedulers. Namely, lambda, multiplicative, one_cycle and step.

The function config_lr_scheduler() helps you setup such a scheduler. See ?config_lr_scheduler() for more information

# Step Learning rate scheduler that reduces learning rate every 16 steps by a factor of 0.5
scheduler <- config_lr_scheduler(type = "step",
                                 step_size = 16,
                                 gamma = 0.5)

nn.fit <- dnn(Sepal.Length~., data = data,lr = 0.01, lr_scheduler= scheduler, verbose = FALSE)
plot of chunk lr_scheduler

plot of chunk lr_scheduler

Optimizer

Optimizer are responsible for fitting the neural network. The optimizer tries to minimize the loss function. As default the stochastic gradient descent is used. Custom optimizers can be used with config_optimizer().
See ?config_optimizer() for more information.


# adam optimizer with learning rate 0.002, betas to 0.95, 0.999 and eps to 1.5e-08
opt <- config_optimizer(
  type = "sgd")

nn.fit <- dnn(Species~., data = data,  optimizer = opt, lr=0.002, verbose=FALSE, loss = "softmax")
plot of chunk optim

plot of chunk optim

Early Stopping

Adding early stopping criteria helps you save time by stopping the training process early, if the validation loss of the current epoch is bigger than the validation loss n epochs early. The n can be defined by the early_stopping argument. It is required to set validation > 0.

# Stops training if validation loss at current epoch is bigger than that 15 epochs earlier
nn.fit <- dnn(Sepal.Length~., data = data, epochs = 1000,
              validation = 0.2, early_stopping = 15, verbose=FALSE)
plot of chunk early_stopping

plot of chunk early_stopping

Continue training process

You can continue the training process of an existing model with continue_training().

# simple example, simply adding another 12 epochs to the training process
nn.fit <- continue_training(nn.fit, epochs = 12, verbose=FALSE)
head(predict(nn.fit))
#>            [,1]
#> [1,] -0.9836558
#> [2,] -1.4395623
#> [3,] -1.3351157
#> [4,] -1.3388002
#> [5,] -0.8847700
#> [6,] -0.4809422

It also allows you to change any training parameters, for example the learning rate. You can analyze the training process with analyze_training().


# picking the model with the smalles validation loss
# with changed parameters, in this case a smaller learning rate and a smaller batchsize
nn.fit <- continue_training(nn.fit,
                            epochs = 32,
                            changed_params = list(lr = 0.001, batchsize = 16),
                            verbose = FALSE)