| Title: | Distributional Random Forests |
| Version: | 1.2.0 |
| BugReports: | https://github.com/lorismichel/drf/issues |
| Description: | An implementation of distributional random forests as introduced in Cevid & Michel & Meinshausen & Buhlmann (2020) <doi:10.48550/arXiv.2005.14458>. |
| License: | GPL-3 |
| LinkingTo: | Rcpp, RcppEigen |
| Depends: | R (≥ 3.4.4) |
| Imports: | stats, fastDummies, Matrix, methods, Rcpp (≥ 0.12.15), transport |
| RoxygenNote: | 7.3.3 |
| Suggests: | DiagrammeR |
| SystemRequirements: | GNU make |
| URL: | https://github.com/lorismichel/drf |
| NeedsCompilation: | yes |
| Packaged: | 2025-12-08 20:13:25 UTC; rstudio |
| Author: | Jeffrey Naf [cre], Loris Michel [aut], Domagoj Cevid [aut] |
| Maintainer: | Jeffrey Naf <jeffrey.naf@unige.ch> |
| Repository: | CRAN |
| Date/Publication: | 2025-12-08 22:00:02 UTC |
Writes each node information If it is a leaf node: show it in different color, show number of samples, show leaf id If it is a non-leaf node: show its splitting variable and splitting value
Description
Writes each node information If it is a leaf node: show it in different color, show number of samples, show leaf id If it is a non-leaf node: show its splitting variable and splitting value
Usage
create_dot_body(tree, index = 1)
Arguments
tree |
the tree to convert |
index |
the index of the current node |
Distributional Random Forests
Description
Trains a Distributional Random Forest which estimates the full conditional distribution P(Y | X)
for possibly multivariate response Y and predictors X. The conditional distribution estimate is represented
as a weighted distribution of the training data. The weights can be conveniently used in the downstream analysis
to estimate any quantity of interest \tau(P(Y | X)).
Usage
drf(
X,
Y,
num.trees = 500,
splitting.rule = "FourierMMD",
num.features = 10,
bandwidth = NULL,
response.scaling = TRUE,
node.scaling = FALSE,
sample.weights = NULL,
sample.fraction = 0.5,
mtry = min(ceiling(sqrt(ncol(X)) + 20), ncol(X)),
min.node.size = 15,
honesty = TRUE,
honesty.fraction = 0.5,
honesty.prune.leaves = TRUE,
alpha = 0.05,
imbalance.penalty = 0,
compute.oob.predictions = TRUE,
num.threads = NULL,
seed = stats::runif(1, 0, .Machine$integer.max),
compute.variable.importance = FALSE
)
Arguments
X |
The covariates used in the regression. Can be either a numeric matrix or a data.frame with numeric, factor, or character columns, where the last two will be one-hot-encoded. |
Y |
The (multivariate) outcome variable. Needs to be a matrix or a data frame consisting of numeric values. |
num.trees |
Number of trees grown in the forest. Default is 500. |
splitting.rule |
A character value. The type of the splitting rule used, can be either "FourierMMD" (MMD splitting criterion with FastMMD approximation for speed) or "CART" (sum of standard CART criteria over the components of Y). |
num.features |
A numeric value, in case of "FourierMMD", the number of random features to sample. |
bandwidth |
A numeric value, the bandwidth of the Gaussian kernel used in case of "FourierMMD", the value is set to NULL by default and the square root of the median heuristic is used. |
response.scaling |
A boolean value, should the responses be standardized before fitting the forest. Default is TRUE. |
node.scaling |
A boolean value, should the responses be standardized in every node of every tree. Default is FALSE. |
sample.weights |
(experimental) Weights given to an observation in estimation. If NULL, each observation is given the same weight. Default is NULL. |
sample.fraction |
Fraction of the data used to build each tree. Note: If honesty = TRUE, these subsamples will further be cut by a factor of honesty.fraction. Default is 0.5. |
mtry |
Number of variables tried for each split. Default is
|
min.node.size |
A target for the minimum number of observations in each tree leaf. Note that nodes with size smaller than min.node.size can occur, as in the original randomForest package. Default is 5. |
honesty |
Whether to use honest splitting (i.e., sub-sample splitting). Default is TRUE. For a detailed description of honesty, honesty.fraction, honesty.prune.leaves, and recommendations for parameter tuning, see the GRF reference for more information (the original source). |
honesty.fraction |
The fraction of data that will be used for determining splits if honesty = TRUE. Default is 0.5 (i.e. half of the data is used for determining splits and the other half for populating the nodes of the tree). |
honesty.prune.leaves |
If TRUE, prunes the estimation sample tree such that no leaves are empty. If FALSE, keeps the same tree as determined in the splits sample (if an empty leave is encountered, that tree is skipped and does not contribute to the estimate). Setting this to FALSE may improve performance on small/marginally powered data, but requires more trees (note: tuning does not adjust the number of trees). Only applies if honesty is enabled. Default is TRUE. |
alpha |
A tuning parameter that controls the maximum imbalance of a split. Default is 0.05, meaning a child node will contain at most 5% of observations in the parent node. |
imbalance.penalty |
A tuning parameter that controls how harshly imbalanced splits are penalized. Default is 0. |
compute.oob.predictions |
Whether OOB predictions on training set should be precomputed. Default is TRUE. |
num.threads |
Number of threads used in training. By default, the number of threads is set to the maximum hardware concurrency. |
seed |
The seed of the C++ random number generator. |
compute.variable.importance |
boolean, should the variable importance be computed in the object. |
Value
A trained Distributional Random Forest object.
Examples
library(drf)
n = 10000
p = 20
d = 3
# Generate training data
X = matrix(rnorm(n * p), nrow=n)
Y = matrix(rnorm(n * d), nrow=n)
Y[, 1] = Y[, 1] + X[, 1]
Y[, 2] = Y[, 2] * X[, 2]
Y[, 3] = Y[, 3] * X[, 1] + X[, 2]
# Fit DRF object
drf.forest = drf(X, Y)
# Generate test data
X_test = matrix(rnorm(10 * p), nrow=10)
out = predict(drf.forest, newdata=X_test)
# Compute E[Y_1 | X] for all data in X_test directly from
# the weights representing the estimated distribution
out$weights %*% out$y[,1]
out = predict(drf.forest, newdata=X_test,
functional='mean')
# Compute E[Y_1 | X] for all data in X_test using built-in functionality
out[,1]
out = predict(drf.forest, newdata=X_test,
functional='quantile',
quantiles=c(0.25, 0.75),
transformation=function(y){y[1] * y[2] * y[3]})
# Compute 25% and 75% quantiles of Y_1*Y_2*Y_3, conditionally on X = X_test[1, ]
out[1,,]
out = predict(drf.forest, newdata=X_test,
functional='cov',
transformation=function(y){matrix(1:6, nrow=2) %*% y})
# Compute 2x2 covariance matrix for (1*Y_1 + 3*Y_2 + 5*Y_3, 2*Y_1 + 4*Y_2 + 6*Y_3),
# conditionally on X = X_test[1, ]
out[1,,]
out = predict(drf.forest, newdata=X_test,
functional='custom',
custom.functional=function(y, w){c(sum(y[, 1] * w), sum(y[, 2] * w))})
# Compute E[Y_1, Y_2 | X] for all data in X_test by providing custom functional that
# computes it from the weights
out
Export a tree in DOT format. This function generates a GraphViz representation of the tree, which is then written into 'dot_string'.
Description
Export a tree in DOT format. This function generates a GraphViz representation of the tree, which is then written into 'dot_string'.
Usage
export_graphviz(tree)
Arguments
tree |
the tree to convert |
Given a trained forest and test data, compute the training sample weights for each test point.
Description
During normal prediction, these weights are computed as an intermediate step towards producing estimates. This function allows for examining the weights directly, so they could be potentially be used as the input to a different analysis.
Usage
get_sample_weights(forest, newdata = NULL, num.threads = NULL)
Arguments
forest |
The trained forest. |
newdata |
Points at which predictions should be made. If NULL, makes out-of-bag predictions on the training set instead (i.e., provides predictions at Xi using only trees that did not use the i-th training example).#' @param max.depth Maximum depth of splits to consider. |
num.threads |
Number of threads used in training. If set to NULL, the software automatically selects an appropriate amount. |
Value
A sparse matrix where each row represents a test sample, and each column is a sample in the training data. The value at (i, j) gives the weight of training sample j for test sample i.
Examples
## Not run:
p <- 10
n <- 100
X <- matrix(2 * runif(n * p) - 1, n, p)
Y <- (X[, 1] > 0) + 2 * rnorm(n)
rrf <- drf(X, matrix(Y,ncol=1), mtry = p)
sample.weights.oob <- get_sample_weights(rrf)
n.test <- 15
X.test <- matrix(2 * runif(n.test * p) - 1, n.test, p)
sample.weights <- get_sample_weights(rrf, X.test)
## End(Not run)
Retrieve a single tree from a trained forest object.
Description
Retrieve a single tree from a trained forest object.
Usage
get_tree(forest, index)
Arguments
forest |
The trained forest. |
index |
The index of the tree to retrieve. |
Value
A DRF tree object containing the below attributes. drawn_samples: a list of examples that were used in training the tree. This includes examples that were used in choosing splits, as well as the examples that populate the leaf nodes. Put another way, if honesty is enabled, this list includes both subsamples from the split (J1 and J2 in the notation of the paper). num_samples: the number of examples used in training the tree. nodes: a list of objects representing the nodes in the tree, starting with the root node. Each node will contain an 'is_leaf' attribute, which indicates whether it is an interior or leaf node. Interior nodes contain the attributes 'left_child' and 'right_child', which give the indices of their children in the list, as well as 'split_variable', and 'split_value', which describe the split that was chosen. Leaf nodes only have the attribute 'samples', which is a list of the training examples that the leaf contains. Note that if honesty is enabled, this list will only contain examples from the second subsample that was used to 'repopulate' the tree (J2 in the notation of the paper).
Examples
## Not run:
# Train a quantile forest.
n <- 50
p <- 10
X <- matrix(rnorm(n * p), n, p)
Y <- X[, 1] * rnorm(n)
q.forest <- quantile_forest(X, Y, quantiles = c(0.1, 0.5, 0.9))
# Examine a particular tree.
q.tree <- get_tree(q.forest, 3)
q.tree$nodes
## End(Not run)
A default leaf_stats for forests classes without a leaf_stats method that always returns NULL.
Description
A default leaf_stats for forests classes without a leaf_stats method that always returns NULL.
Usage
## Default S3 method:
leaf_stats(forest, samples, ...)
Arguments
forest |
Any forest |
samples |
The samples to include in the calculations. |
... |
Additional arguments (currently ignored). |
Calculate summary stats given a set of samples for regression forests.
Description
Calculate summary stats given a set of samples for regression forests.
Usage
## S3 method for class 'drf'
leaf_stats(forest, samples, ...)
Arguments
forest |
The GRF forest |
samples |
The samples to include in the calculations. |
... |
Additional arguments (currently ignored). |
Value
A named vector containing summary stats
Compute the median heuristic for the MMD bandwidth choice
Description
Compute the median heuristic for the MMD bandwidth choice
Usage
medianHeuristic(Y)
Arguments
Y |
the response matrix |
Value
the median heuristic
Plot a DRF tree object.
Description
Plot a DRF tree object.
Usage
## S3 method for class 'drf_tree'
plot(x, ...)
Arguments
x |
The tree to plot |
... |
Additional arguments (currently ignored). |
Predict from Distributional Random Forests object
Description
Obtain predictions from a DRF forest object. For any point x in the predictor space, it returns the estimate of the conditional distribution
P(Y | X=x) represented as a weighted distribution \sum_i w_i y_i of the training observations {y_i}.
Additionally, this function also provides support to directly obtain estimates of certain target quantities \tau(P(Y | X)),
such as e.g. conditional quantiles, variances or correlations.
Usage
## S3 method for class 'drf'
predict(
object,
newdata = NULL,
functional = NULL,
transformation = NULL,
custom.functional = NULL,
num.threads = NULL,
...
)
Arguments
object |
Trained DRF forest object. |
newdata |
Points at which predictions should be made. If NULL, returns out-of-bag
predictions on the training set (i.e., for every training point |
functional |
Optional. String indicating the statistical functional that we want to compute from the weights. One option between:
|
transformation |
An optional transformation function that is applied to the responses before computing the target functional. It helps to extend the functionality to a much wider range of targets.
The responses are not transformed by default, i.e. the identity function |
custom.functional |
A user-defined function when |
num.threads |
Number of threads used for computing. If set to NULL, the software automatically selects an appropriate amount. |
... |
additional parameters. |
Value
If functional equals NULL, returns a list containing the matrix of training responses as well as the matrix of weights, whose number of rows corresponds the number of rows of "newdata" and the number of columns corresponds to the number of training data points. If functional is specified, the desired quantity is returned, in the format described above.
Examples
library(drf)
n = 10000
p = 20
d = 3
# Generate training data
X = matrix(rnorm(n * p), nrow=n)
Y = matrix(rnorm(n * d), nrow=n)
Y[, 1] = Y[, 1] + X[, 1]
Y[, 2] = Y[, 2] * X[, 2]
Y[, 3] = Y[, 3] * X[, 1] + X[, 2]
# Fit DRF object
drf.forest = drf(X, Y)
# Generate test data
X_test = matrix(rnorm(10 * p), nrow=10)
out = predict(drf.forest, newdata=X_test)
# Compute E[Y_1 | X] for all data in X_test directly from
# the weights representing the estimated distribution
out$weights %*% out$y[,1]
out = predict(drf.forest, newdata=X_test,
functional='mean')
# Compute E[Y_1 | X] for all data in X_test using built-in functionality
out[,1]
out = predict(drf.forest, newdata=X_test,
functional='quantile',
quantiles=c(0.25, 0.75),
transformation=function(y){y[1] * y[2] * y[3]})
# Compute 25% and 75% quantiles of Y_1*Y_2*Y_3, conditionally on X = X_test[1, ]
out[1,,]
out = predict(drf.forest, newdata=X_test,
functional='cov',
transformation=function(y){matrix(1:6, nrow=2) %*% y})
# Compute 2x2 covariance matrix for (1*Y_1 + 3*Y_2 + 5*Y_3, 2*Y_1 + 4*Y_2 + 6*Y_3),
# conditionally on X = X_test[1, ]
out[1,,]
out = predict(drf.forest, newdata=X_test,
functional='custom',
custom.functional=function(y, w){c(sum(y[, 1] * w), sum(y[, 2] * w))})
# Compute E[Y_1, Y_2 | X] for all data in X_test by providing custom functional that
# computes it from the weights
out
Print a DRF forest object.
Description
Print a DRF forest object.
Usage
## S3 method for class 'drf'
print(x, decay.exponent = 2, max.depth = 4, ...)
Arguments
x |
The tree to print. |
decay.exponent |
A tuning parameter that controls the importance of split depth. |
max.depth |
The maximum depth of splits to consider. |
... |
Additional arguments (currently ignored). |
Print a DRF tree object.
Description
Print a DRF tree object.
Usage
## S3 method for class 'drf_tree'
print(x, ...)
Arguments
x |
The tree to print. |
... |
Additional arguments (currently ignored). |
Calculate which features the forest split on at each depth.
Description
Calculate which features the forest split on at each depth.
Usage
split_frequencies(forest, max.depth = 4)
Arguments
forest |
The trained forest. |
max.depth |
Maximum depth of splits to consider. |
Value
A matrix of split depth by feature index, where each value is the number of times the feature was split on at that depth.
Examples
## Not run:
# Train a quantile forest.
n <- 50
p <- 10
X <- matrix(rnorm(n * p), n, p)
Y <- X[, 1] * rnorm(n)
q.forest <- quantile_forest(X, Y, quantiles = c(0.1, 0.5, 0.9))
# Calculate the split frequencies for this forest.
split_frequencies(q.forest)
## End(Not run)
Variable importance based on MMD
Description
compute an mmd-based variable importance for the drf fit.
Usage
variableImportance(
object,
h = NULL,
response.scaling = TRUE,
type = "difference"
)
Arguments
object |
an S3 object of class drf. |
h |
the bandwidth parameter, default to NULL using then the median heuristic. |
response.scaling |
a boolean value indicating if the responses should be scaled globally beforehand. |
type |
the type of importance, could be either "raw", the plain MMD values, "relative", the ratios to the observed MMD or "difference", the excess to the observed MMD |
Value
a vector of variable importance values.
Calculate a simple measure of 'importance' for each feature.
Description
A simple weighted sum of how many times feature i was split on at each depth in the forest.
Usage
variable_importance(forest, decay.exponent = 2, max.depth = 4)
Arguments
forest |
The trained forest. |
decay.exponent |
A tuning parameter that controls the importance of split depth. |
max.depth |
Maximum depth of splits to consider. |
Value
A list specifying an 'importance value' for each feature.
Examples
## Not run:
# Train a quantile forest.
n <- 50
p <- 10
X <- matrix(rnorm(n * p), n, p)
Y <- X[, 1] * rnorm(n)
q.forest <- quantile_forest(X, Y, quantiles = c(0.1, 0.5, 0.9))
# Calculate the 'importance' of each feature.
variable_importance(q.forest)
## End(Not run)
Weighted quantiles
Description
Weighted quantiles
Usage
weighted.quantile(x, w, probs = seq(0, 1, 0.25), na.rm = TRUE)
Arguments
x |
a vector of observations |
w |
a vector of weights |
probs |
the given probabilities for which we want to get quantiles |
na.rm |
should we remove missing values. |