Abstract
‘cito’ allows you to build and train neural networks using the R formula syntax. It relies on the ‘torch’ package for numerical computations and optional graphic card support.Before using ‘cito’ make sure that the current version of ‘torch’ is installed and running.
if(!require(torch)) install.packages("torch")
#> Loading required package: torch
library(torch)
if(!torch_is_installed()) install_torch()
library (cito)
If you have problems installing Torch, check out the installation help from the torch developer.
Cito can handle many different response types. Common loss functions from ML but also likelihoods for statistical models are supported:
Name | Explanation | Example / Task | Meta-Code |
mse |
mean squared error | Regression, predicting continuous values | dnn(Sepal.Length~., ..., loss = "mse") |
mae |
mean absolute error | Regression, predicting continuous values | dnn(Sepal.Length~., ..., loss = "msa") |
softmax |
categorical cross entropy | Multi-class, species classification | dnn(Species~., ..., loss = "softmax") |
cross-entropy |
categorical cross entropy | Multi-class, species classification | dnn(Species~., ..., loss = "cross-entropy") |
gaussian |
Normal likelihood | Regression, residual error is also estimated (similar
to stats::lm() ) |
dnn(Sepal.Length~., ..., loss = "gaussian") |
binomial |
Binomial likelihood | Classification/Logistic regression, mortality (0/1 data) | dnn(Presence~., ..., loss = "Binomial") |
poisson |
Poisson likelihood | Regression, count data, e.g. species abundances | dnn(Abundance~., ..., loss = "Poisson") |
Moreover, all non multilabel losses (all except for softmax or cross-entropy) can be modeled as multilabel using the cbind syntax
dnn(cbind(Sepal.Length, Sepal.Width)~., …, loss = "mse")
The likelihoods (Gaussian, Binomial, and Poisson) can be also passed as their stats equivalents:
dnn(Sepal.Length~., ..., loss = stats::gaussian)
In this vignette, we will work with the irirs dataset and build a regression model.
data <- datasets::iris
head(data)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#scale dataset
data <- data.frame(scale(data[,-5]),Species = data[,5])
In ‘cito’, neural networks are specified and fitted with the
dnn()
function. Models can also be trained on the GPU by
setting device = "cuda"
(but only if you have installed the
CUDA dependencies). This is suggested if you are working with large data
sets or networks.
library(cito)
#fitting a regression model to predict Sepal.Length
nn.fit <- dnn(Sepal.Length~. , data = data, epochs = 12, loss = "mse", verbose=FALSE)
You can plot the network structure to give you a visual feedback of the created object. e aware that this may take some time for large networks.
plot of chunk plotnn
The neural network 5 input nodes (3 continoues features, Sepal.Width, Petal.Length, Petal.Width and the contrasts for the Species variable (n_classes - 1)) and 1 output node for the response (Sepal.Length).
At the start of the training we calculate a baseline loss for an an intercept only model. It allows us to control the training because the goal is to beat the baseline loss. If we don’t, we need to adjust the optimization parameters (epochs and lr (learning rate)):
nn.fit <- dnn(Sepal.Length~. , data = data, epochs = 50, lr = 0.6, loss = "mse", verbose = FALSE) # lr too high
#> Error in eval_bare(loop, env): Loss is NA. Bad training, please check learning rate or regularization strength. See vignette('02_Troubleshooting') for help.
plot of chunk unnamed-chunk-2
vs
plot of chunk unnamed-chunk-3
See vignette("B-Training_neural_networks"
) for more
details on how to adjust the optimization procedure and increase the
probability of convergence.
In order to see where your model might suffer from overfitting the
addition of a validation set can be useful. With dnn()
you
can put validation = 0.x
and define a percentage that will
not be used for training and only for validation after each epoch.
During training, a loss plot will show you how the two losses behave
(see vignette("B-Training_neural_networks"
) for details on
training NN and guaranteeing their convergence).
#20% of data set is used as validation set
nn.fit <- dnn(Sepal.Length~., data = data, epochs = 32,
loss= "mse", validation = 0.2)
Weights oft the last and the epoch with the lowest validation loss are saved:
The default is to use the weights of the last epoch. But we can also tell the model to use the weights with the lowest validation loss:
xAI can produce outputs that are similar to known outputs from statistical models:
xAI Method | Explanation | Statistical equivalent |
---|---|---|
Feature importance (returned by summary(nn.fit) ) |
Feature importance based on permutations. See Fisher, Rudin, and Dominici (2018) Similar to how much variance in the data is explained by the features. | anova(model) |
Average conditional effects (ACE) (returned by
summary(nn.fit) ) |
Average of local derivatives, approximation of linear effects. See Pichler and Hartig (2023) | lm(Y~X) |
Standard Deviation of Conditional Effects (SDCE) (returned by
summary(nn.fit) ) |
Standard deviation of the average conditional effects. Correlates with the non-linearity of the effects. | |
Partial dependency plots (PDP(nn.fit) ) |
Visualization of the response-effect curve. | plot(allEffects(model)) |
Accumulated local effect plots (ALE(nn.fit) ) |
Visualization of the response-effect curve. More robust against collinearity compared to PDPs | plot(allEffects(model)) |
The summary()
returns feature importance, ACE and
SDCE:
# Calculate and return feature importance
summary(nn.fit)
#> Summary of Deep Neural Network Model
#>
#> Feature Importance:
#> variable importance_1
#> 1 Sepal.Width 1.133617
#> 2 Petal.Length 2.286806
#> 3 Petal.Width 2.783383
#> 4 Species 1.053672
#>
#> Average Conditional Effects:
#> Response_1
#> Sepal.Width 0.06250046
#> Petal.Length 0.35557117
#> Petal.Width 0.39397602
#>
#> Standard Deviation of Conditional Effects:
#> Response_1
#> Sepal.Width 0.08405767
#> Petal.Length 0.08007347
#> Petal.Width 0.06509943
We can use bootstrapping to obtain uncertainties for the xAI metrics (and also for the predictions). For that we have to retrain our model with enabled bootstrapping:
df = data
df[,2:4] = scale(df[,2:4]) # scaling can help the NN to convergence faster
nn.fit <- dnn(Sepal.Length~., data = df,
epochs = 100,
verbose = FALSE,
loss= "mse",
bootstrap = 30L
)
summary(nn.fit)
#> Summary of Deep Neural Network Model
#>
#> ── Feature Importance
#> Importance Std.Err Z value Pr(>|z|)
#> Sepal.Width → Sepal.Length 1.084 0.353 3.07 0.0022 **
#> Petal.Length → Sepal.Length 6.778 2.649 2.56 0.0105 *
#> Petal.Width → Sepal.Length 1.075 0.496 2.17 0.0301 *
#> Species → Sepal.Length 0.209 0.119 1.76 0.0790 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> ── Average Conditional Effects
#> ACE Std.Err Z value Pr(>|z|)
#> Sepal.Width → Sepal.Length 0.2761 0.0578 4.77 1.8e-06 ***
#> Petal.Length → Sepal.Length 0.7587 0.1142 6.64 3.1e-11 ***
#> Petal.Width → Sepal.Length 0.2448 0.0857 2.86 0.0043 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> ── Standard Deviation of Conditional Effects
#> ACE Std.Err Z value Pr(>|z|)
#> Sepal.Width → Sepal.Length 0.1161 0.0260 4.46 8.1e-06 ***
#> Petal.Length → Sepal.Length 0.2134 0.0823 2.59 0.00953 **
#> Petal.Width → Sepal.Length 0.1021 0.0269 3.80 0.00015 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We find that Petal.Length and Sepal.Width are significant (categorical features are not supported yet for average conditional effects).
Let’s compare the output to statistical outputs:
anova(lm(Sepal.Length~., data = df))
#> Analysis of Variance Table
#>
#> Response: Sepal.Length
#> Df Sum Sq Mean Sq F value Pr(>F)
#> Sepal.Width 1 2.060 2.060 15.0011 0.0001625 ***
#> Petal.Length 1 123.127 123.127 896.8059 < 2.2e-16 ***
#> Petal.Width 1 2.747 2.747 20.0055 1.556e-05 ***
#> Species 2 1.296 0.648 4.7212 0.0103288 *
#> Residuals 144 19.770 0.137
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Feature importance and the anova report Petal.Length as the most important feature.
Visualization of the effects:
plot of chunk unnamed-chunk-9
library(effects)
#> Loading required package: carData
#> lattice theme set by effectsTheme()
#> See ?effectsTheme for details.
plot(allEffects(lm(Sepal.Length~., data = df)))
plot of chunk unnamed-chunk-10
There are some differences between the statistical model and the NN - which is to be expected because the NN can fit the data more flexible. But at the same time the differences have large confidence intervals (e.g. the effect of Petal.Width)
The architecture in NN usually refers to the width and depth of the hidden layers (the layers between the input and the output layer) and their activation functions. You can increase the complexity of the NN by adding layers and/or making them wider:
# "simple NN" - low complexity
nn.fit <- dnn(Sepal.Length~., data = data, epochs = 100,
loss= "mse", validation = 0.2,
hidden = c(5L), verbose=FALSE)
plot of chunk unnamed-chunk-11
# "large NN" - high complexity
nn.fit <- dnn(Sepal.Length~., data = data, epochs = 100,
loss= "mse", validation = 0.2,
hidden = c(100L, 100), verbose=FALSE)
plot of chunk unnamed-chunk-11
There is no definitive guide to choosing the right architecture for
the right task. However, there are some general rules/recommendations:
In general, wider, and deeper neural networks can improve generalization
- but this is a double-edged sword because it also increases the risk of
overfitting. So, if you increase the width and depth of the network, you
should also add regularization (e.g., by increasing the lambda
parameter, which corresponds to the regularization strength).
Furthermore, in Pichler &
Hartig, 2023, we investigated the effects of the hyperparameters on
the prediction performance as a function of the data size. For example,
we found that the selu
activation function outperforms
relu
for small data sizes (<100 observations).
We recommend starting with moderate sizes (like the defaults), and if
the model doesn’t generalize/converge, try larger networks along with a
regularization that helps to minimize the risk of overfitting (see
vignette("B-Training_neural_networks")
).
By default, all layers are fitted with SeLU as activation function. \[ relu(x) = max (0,x) \]You can also adjust the activation function of each layer individually to build exactly the network you want. In this case you have to provide a vector the same length as there are hidden layers. The activation function of the output layer is chosen with the loss argument and does not have to be provided.
#selu as activation function for all layers:
nn.fit <- dnn(Sepal.Length~., data = data, hidden = c(10,10,10,10), activation= "relu")
#layer specific activation functions:
nn.fit <- dnn(Sepal.Length~., data = data,
hidden = c(10,10,10,10), activation= c("relu","selu","tanh","sigmoid"))
Note: The default activation function should be adequate for most tasks. We don’t recommend tuning it.
If elastic net is used, ‘cito’ will produce a sparse, generalized neural network. The L1/L2 loss can be controlled with the arguments alpha and lambda.
\[ loss = \lambda * [ (1 - \alpha) * |weights| + \alpha |weights|^2 ] \]
Dropout regularization as proposed in Srivastava et al. can be controlled similar to elastic net regularization. In this approach, a percentage of different nodes gets left during each epoch.
Learning rate scheduler allow you to start with a high learning rate and decrease it during the training process. This leads to an overall faster training. You can choose between different types of schedulers. Namely, lambda, multiplicative, one_cycle and step.
The function config_lr_scheduler() helps you setup such a scheduler. See ?config_lr_scheduler() for more information
# Step Learning rate scheduler that reduces learning rate every 16 steps by a factor of 0.5
scheduler <- config_lr_scheduler(type = "step",
step_size = 16,
gamma = 0.5)
nn.fit <- dnn(Sepal.Length~., data = data,lr = 0.01, lr_scheduler= scheduler, verbose = FALSE)
plot of chunk lr_scheduler
Optimizer are responsible for fitting the neural network. The
optimizer tries to minimize the loss function. As default the stochastic
gradient descent is used. Custom optimizers can be used with
config_optimizer()
.
See ?config_optimizer()
for more information.
# adam optimizer with learning rate 0.002, betas to 0.95, 0.999 and eps to 1.5e-08
opt <- config_optimizer(
type = "sgd")
nn.fit <- dnn(Species~., data = data, optimizer = opt, lr=0.002, verbose=FALSE, loss = "softmax")
plot of chunk optim
Adding early stopping criteria helps you save time by stopping the training process early, if the validation loss of the current epoch is bigger than the validation loss n epochs early. The n can be defined by the early_stopping argument. It is required to set validation > 0.
# Stops training if validation loss at current epoch is bigger than that 15 epochs earlier
nn.fit <- dnn(Sepal.Length~., data = data, epochs = 1000,
validation = 0.2, early_stopping = 15, verbose=FALSE)
plot of chunk early_stopping
You can continue the training process of an existing model with continue_training().
# simple example, simply adding another 12 epochs to the training process
nn.fit <- continue_training(nn.fit, epochs = 12, verbose=FALSE)
head(predict(nn.fit))
#> [,1]
#> [1,] -0.9836558
#> [2,] -1.4395623
#> [3,] -1.3351157
#> [4,] -1.3388002
#> [5,] -0.8847700
#> [6,] -0.4809422
It also allows you to change any training parameters, for example the learning rate. You can analyze the training process with analyze_training().