The glmnetUtils
package provides a collection of tools to streamline the process of
fitting elastic net models with glmnet. I wrote the
package after a couple of projects where I found myself writing the same
boilerplate code to convert a data frame into a predictor matrix and a
response vector. In addition to providing a formula interface, it also
features a function cva.glmnet
to do crossvalidation for
both \(\alpha\) and \(\lambda\), as well as some utility
functions.
The interface that glmnetUtils provides is very much the same as for most modelling functions in R. To fit a model, you provide a formula and data frame. You can also provide any arguments that glmnet will accept. Here are some simple examples for different types of data:
## Call:
## glmnetUtils:::glmnet.formula(formula = mpg ~ cyl + disp + hp,
## data = mtcars)
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 1
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03326 0.11690 0.41003 1.02839 1.44125 5.05505
# multinomial logistic regression with specified elastic net alpha parameter
(irisMod <- glmnet(Species ~ ., data=iris, family="multinomial", alpha=0.5))
## Call:
## glmnetUtils:::glmnet.formula(formula = Species ~ ., data = iris,
## family = "multinomial", alpha = 0.5)
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 0.5
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000870 0.0008707 0.0087093 0.0979220 0.0870709 0.8699915
# Poisson regression with an offset
(InsMod <- glmnet(Claims ~ District + Group + Age, data=MASS::Insurance,
family="poisson", offset=log(Holders)))
## Call:
## glmnetUtils:::glmnet.formula(formula = Claims ~ District + Group +
## Age, data = MASS::Insurance, family = "poisson", offset = log(Holders))
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 1
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02877 0.11613 0.46883 1.40515 1.89269 7.64083
Under the hood, glmnetUtils creates a model matrix and response
vector, and passes them to the glmnet package to do the actual model
fitting. A simple print
method is also provided, to show
the main model details at a glance. I’ll describe shortly what the
“sparse model matrix” and “use model.frame” options do.
Predicting from a model works as you’d expect: just pass a data frame
containing the new observations to the predict
method. You
can also specify any arguments that predict.glmnet
accepts.
# least squares regression: get predictions for lambda=1
predict(mtcarsMod, newdata=mtcars, s=1)
# multinomial logistic regression: get predicted class
predict(irisMod, newdata=iris, type="class")
# Poisson regression: need to specify offset
predict(InsMod, newdata=MASS::Insurance, offset=log(Holders))
If you want, you can still use the original model matrix-plus-response syntax:
mtcarsX <- as.matrix(mtcars[c("cyl", "disp", "hp")])
mtcarsY <- mtcars$mpg
mtcarsMod2 <- glmnet(mtcarsX, mtcarsY)
summary(as.numeric(predict(mtcarsMod, mtcars) -
predict(mtcarsMod2, mtcarsX)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
This shows that the resulting models are identical, in terms of the predictions they make and the regularisation parameters used.
There are two ways in which glmnetUtils can generate a model matrix
out of a formula and data frame. The first is to use the standard R
machinery comprising model.frame
and
model.matrix
; and the second is to build the matrix one
variable at a time.
model.frame
This is the simpler option, and the one that is most compatible with
other R modelling functions. The model.frame
function takes
a formula and data frame and returns a model frame: a data
frame with special information attached that lets R make sense of the
terms in the formula. For example, if a formula includes an interaction
term, the model frame will specify which columns in the data relate to
the interaction, and how they should be treated. Similarly, if the
formula includes expressions like exp(x)
or
I(x^2)
on the RHS, model.frame
will evaluate
these expressions and include them in the output.
The major disadvantage of using model.frame
is that it
generates a terms
object, which encodes how variables and
interactions are organised. One of the attributes of this object is a
matrix with one row per variable, and one column per main effect and
interaction. At minimum, this is (approximately) a \(p \times p\) square matrix where \(p\) is the number of main effects in the
model. For wide datasets with \(p >
10000\), this matrix can approach or exceed a gigabyte in size.
Even if there is enough memory to store such an object, generating the
model matrix can be very slow.
Another issue with the standard R approach is the treatment of
factors. Normally, model.matrix will turn an \(N\)-level factor into an indicator matrix
with \(N-1\) columns, with one column
being dropped. This is necessary for unregularised models as fit with
lm
and glm
, since the full set of \(N\) columns is linearly dependent. With the
usual treatment
contrasts, the interpretation is that the dropped column represents
a baseline level, while the coefficients for the other columns represent
the difference in the response relative to the baseline.
This may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.
To deal with the problems above, glmnetUtils by default will avoid
using model.frame
, instead building up the model matrix
term-by-term. This avoids the memory cost of creating a
terms
object, and can be noticeably faster than the
standard approach. It will also include one column in the model matrix
for all levels in a factor; that is, no baseline level is
assumed. In this situation, the coefficients represent differences from
the overall mean response, and shrinking them to zero is
meaningful (usually).
This works in an additive fashion, ie the formula
~ a + b:c + d*e
is treated as consisting of three terms,
a
, b:c
and d*e
each of which is
processed independently of the others. A dot in the formula includes all
main effect terms, ie ~ . + a:b + f(x)
expands to
~ a + b + x + a:b + f(x)
(assuming a, b and x are the only
columns in the data). Note that a formula like
~ (a + b) + (c + d)
will be treated as two terms,
a + b
and c + d
.
To examine the speed impact of using model.frame
, let’s
do some simple comparisons of run times. We’ll generate sample datasets
with 100, 1,000 and 10,000 predictors, and then run glmnet
with both options for generating the model matrix.
# generate sample (uncorrelated) data of a given size
makeSampleData <- function(N, P)
{
X <- matrix(rnorm(N*P), nrow=N)
data.frame(y=rnorm(N), X)
}
# test for three sizes: 100/1000/10000 predictors
df1 <- makeSampleData(N=1000, P=100)
df2 <- makeSampleData(N=1000, P=1000)
df3 <- makeSampleData(N=1000, P=10000)
library(microbenchmark)
res <- microbenchmark(
glmnet(y ~ ., df1, use.model.frame=TRUE),
glmnet(y ~ ., df1, use.model.frame=FALSE),
glmnet(y ~ ., df2, use.model.frame=TRUE),
glmnet(y ~ ., df2, use.model.frame=FALSE),
glmnet(y ~ ., df3, use.model.frame=TRUE),
glmnet(y ~ ., df3, use.model.frame=FALSE),
times=10
)
print(res, unit="s", digits=2)
## Unit: seconds
## expr min lq mean median uq max neval
## glmnet(y ~ ., df1, use.model.frame = TRUE) 0.024 0.024 0.027 0.025 0.029 0.032 10
## glmnet(y ~ ., df1, use.model.frame = FALSE) 0.023 0.026 0.029 0.026 0.028 0.051 10
## glmnet(y ~ ., df2, use.model.frame = TRUE) 3.703 3.916 4.258 4.272 4.428 5.153 10
## glmnet(y ~ ., df2, use.model.frame = FALSE) 3.756 3.874 4.291 4.352 4.561 5.073 10
## glmnet(y ~ ., df3, use.model.frame = TRUE) 11.973 12.353 13.262 13.350 13.864 14.622 10
## glmnet(y ~ ., df3, use.model.frame = FALSE) 4.295 4.639 4.992 4.822 5.060 6.111 10
From this, we can see that for datasets with up to 1,000 predictors,
both methods are about as fast as each other. However, for 10,000
predictors (not uncommon these days), the model.frame
method takes three times as long as building the model matrix term by
term.
What happens if we take it up to 100,000 predictors? As seen below,
the standard approach of using model.frame
fails when R
runs out of memory: the data frame itself is about 800MB in size, but
trying to allocate the terms
object requires more then
67GB. However, building the model matrix by term still works, and (on
this machine) finishes in about two minutes.
## Warning in terms.formula(formula, data = data): Reached total allocation of
## 32666Mb: see help(memory.size)
## Error: cannot allocate vector of size 37.3 Gb
## Call:
## glmnet.formula(formula = y ~ ., data = df4, use.model.frame = FALSE)
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha: 1
## Lambda summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002393 0.006506 0.017680 0.032470 0.048080 0.130700
As an option, glmnetUtils can also generate a sparse model
matrix, using the sparse.model.matrix
function provided in
the Matrix package. This works exactly the same as a regular model
matrix, but takes up significantly less memory if many of its entries
are zero. A scenario where this is the case would be where many of the
predictors are factors, each with a large number of levels. This can be
combined with both of the previously mentioned options for generating
model matrices.
One piece missing from the standard glmnet package is a way of
choosing \(\alpha\), the elastic net
mixing parameter, similar to how cv.glmnet
chooses \(\lambda\), the shrinkage parameter. To fix
this, glmnetUtils provides the cva.glmnet
function, which
uses crossvalidation to examine the impact on the model of changing
\(\alpha\) and \(\lambda\). The interface is the same as for
the other functions:
# Leukemia dataset from Trevor Hastie's website:
# https://web.stanford.edu/~hastie/glmnet/glmnetData/Leukemia.RData
leuk <- do.call(data.frame, Leukemia)
leukMod <- cva.glmnet(y ~ ., data=leuk, family="binomial")
leukMod
## Call:
## cva.glmnet.formula(formula = y ~ ., data = leuk, family = "binomial")
##
## Model fitting options:
## Sparse model matrix: FALSE
## Use model.frame: FALSE
## Alpha values: 0 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1
## Number of crossvalidation folds for lambda: 10
cva.glmnet
uses the algorithm described in the help for
cv.glmnet
, which is to fix the distribution of observations
across folds and then call cv.glmnet
in a loop with
different values of \(\alpha\).
Optionally, you can parallelise this outer loop, by setting the
outerParallel
argument to a non-NULL value. Currently,
glmnetUtils supports the following methods of parallelisation:
parLapply
in the parallel package. To use this, set
outerParallel
to a valid cluster object created by
makeCluster
.rxExec
as supplied by the (now retired) RevoScaleR
package from Microsoft R Server. To use this, set
outerParallel
to a valid compute context created by
RxComputeContext
, or a character string specifying such a
context.If the outer loop is run in parallel, cva.glmnet
can
check if the inner loop (over \(\lambda\)) is also set to run in parallel,
and disable this if it would lead to contention for cores.
Because crossvalidation is often a statistically noisy procedure, it doesn’t try to automatically choose \(\alpha\) and \(\lambda\) for you. Instead you can plot the output, to see how the results depend on the values of these parameters. Using this information, you can choose appropriate values for your data.
In this case, we see that values of \(\alpha\) close to \(1\) tend to lead to better accuracy. The
curves don’t have a well-defined minimum, but they do flatten out for
lower values of \(\lambda\). As the
cv.glmnet
documentation recommends though, it’s a good idea
to run cva.glmnet
multiple times to reduce the impact of
noise.
A cva.glmnet
object contains a list of individual
cv.glmnet
objects, corresponding to the different \(\alpha\) values tried. This lets you plot
the crossvalidation results easily for a given \(\alpha\):
The glmnetUtils package is a way to improve quality of life for users of glmnet. As with many R packages, it’s always under development; you can get the latest version from my GitHub repo. If you find a bug, or if you want to suggest improvements to the package, please feel free to contact me at hongooi73@gmail.com.