::install_github('tingtingzhan/groupedHyperframe')
remotes::install_github('tingtingzhan/hyper.gam') remotes
2025-06-27
This vignette provides examples of using R
packages
groupedHyperframe
, Zhan and Chervoneva (2025a), CRAN
, Github, RPubshyper.gam
, Zhan and Chervoneva (2025b), CRAN
, Github, RPubsfor deriving single index predictors of scalar outcomes based on spatial and non-spatial single-cell imaging data.
Experimental (and maybe unstable) features are implemented extremely frequently on Github. Active developers should use the Github version; suggestions and bug reports are welcome!
::install_github('tingtingzhan/groupedHyperframe')
remotes::install_github('tingtingzhan/hyper.gam') remotes
Stable releases to CRAN
are typically updated every 2 to 3 months, or when the authors have an upcoming manuscript in the peer-reviewing process. Developers should not use the CRAN
version!
::install.packages('groupedHyperframe')
utils::install.packages('hyper.gam') utils
Package hyper.gam
Imports
packages
caret
(Kuhn 2008, version 7.0.1, key dependency), for k-fold prediction (to be documented in future)cli
(Csárdi 2025, version 3.6.5), for attractive command line interfaces (CLIs)mgcv
(Wood 2017, version 1.9.3, key dependency), for fitting generalized additive modelsnlme
(J. Pinheiro, Bates, and R Core Team 2025, version 3.1.168), to import the S3
generic function nlme::getData()
parallel
(shipped with vanilla R
, version 4.5.1), for parallel computing of k-fold prediction (to be documented in future)plotly
(Sievert 2020, version 4.11.0, key dependency), for interactive visualization using HTML widgetgroupedHyperframe
(Zhan and Chervoneva 2025a, version 0.2.4, key dependency), for data structurePackage groupedHyperframe
Imports
packages
cli
(Csárdi 2025, version 3.6.5)matrixStats
(Bengtsson 2025, version 1.5.0, key dependency), for matrix arithmeticparallel
(shipped with vanilla R
, version 4.5.1)pracma
(Borchers 2023, version 2.4.4, key dependency), for (cumulative) trapezoidal integrationspatstat.explore
(version 3.4.3.6) and spatstat.geom
(version 3.4.1.13), Baddeley, Rubak, and Turner (2015), key dependency, for spatial statisticsSpatialPack
(Vallejos, Osorio, and Bevilacqua 2020, version 0.4.1, key dependency), for Tjøstheim’s coefficient of spatial associationSince package spatstat.explore
version 3.4-3.004, we have a Registered S3 method overwritten
when loading packages groupedHyperframe
and hyper.gam
. The function name clash is between spatstat.explore::plot.roc()
and pROC::plot.roc()
. The package pROC
(Robin et al. 2011) is among the Imports
of package caret
(Kuhn 2008).
Examples in this vignette require that the search
path has
library(groupedHyperframe)
library(hyper.gam)
#> Registered S3 method overwritten by 'pROC':
#> method from
#> plot.roc spatstat.explore
library(survival)
Term / Abbreviation | Description |
---|---|
|> |
Forward pipe operator introduced since R 4.1.0, as well as the _ placeholder |
:: |
Explicitly-namespaced function |
attr , attributes |
Attributes |
contour |
Contour line, https://en.wikipedia.org/wiki/Contour_line |
createDataPartition |
Test vs. training data set partition, from package caret (Kuhn 2008) |
csv , read.csv |
(Read) comma-separated-value files |
coxph |
Cox proportional hazards model, from package survival (Therneau 2024) |
gam |
Generalized additive models (GAM), from package mgcv (Wood 2017) |
groupedHyperframe |
Grouped hyper data frame, from package groupedHyperframe (Zhan and Chervoneva 2025a) |
hypercolumns , hyperframe |
(Hyper columns of) hyper data frame, from package spatstat.geom (Baddeley and Turner 2005) |
htmlwidget |
HyperText Markup Language (HTML) widgets, from package htmlwidgets (Vaidyanathan et al. 2023), https://www.htmlwidgets.org, https://plotly.com/r/getting-started/ |
inherits |
Class inheritance |
L -suffix |
Create integer constant |
mgcv::s |
(Set up of) spline based smooths (Wood 2003) |
mgcv::ti |
Tensor product interaction (Wood 2006) |
persp |
Perspective plot, https://en.wikipedia.org/wiki/Perspective_(graphical) |
PFS |
Progression/recurrence free survival, https://en.wikipedia.org/wiki/Progression-free_survival |
predict |
Model predictions |
predict.gam |
GAM model predictor |
quantile |
Quantile |
S3 , generic , methods |
S3 object oriented system, UseMethod ; getS3method ; https://adv-r.hadley.nz/s3.html |
search |
Search path for R objects |
Surv |
Survival, i.e., time-to-event, object, from package survival (Therneau 2024) |
symbol , name |
Names and symbols to refer to R objects |
The authors thank Erjia Cui for his contribution to function hyper_gam()
.
This work is supported by National Institutes of Health, U.S. Department of Health and Human Services grants
R01CA222847 (I. Chervoneva, T. Zhan, and H. Rui)
R01CA253977 (H. Rui and I. Chervoneva).
Single-cell multiplex immuno-fluorescence immunohistochemistry (mIF-IHC) imaging data are the result of digital processing of the microscopic images of tissue stained with selected antibodies. Quantitative pathology platforms, e.g., Akoya or QuPath, support cell segmentation of mIF-IHC images and quantification of the mean protein expression in each cell. The cell centroid coordinates and cell signal intensities (CSIs) for each stained protein are usually extracted as individual comma-separated values .csv
files. For each cell in a tissue image, the data include the cell centroid coordinates and cell signal intensity (CSI) for each quantified protein expression. The data may have multiple levels of hierarchical clustering. For example, single cells are clustered within a Region of Interest (ROI) or a tissue core, ROIs are clustered within a tissue or tissue cores are clustered within a patient.
Applications based on the Quantile Index (QI) methodology is described in our peer-reviewed publications Yi et al. (2023a); Yi et al. (2023b); Yi et al. (2025).
Data example Ki67
included in package groupedHyperframe
(Github, CRAN
) is a grouped hyper data frame, an extension of the hyper data frame hyperframe
object defined in package spatstat.geom
(Baddeley, Rubak, and Turner 2015; Baddeley and Turner 2005). The numeric-hypercolumn logKi67
, whose elements are numeric vectors of different lengths, contains the log-transformed Ki67 protein expression CSIs in each tissueID
nested in patientID
. Such nested grouping structure is denoted by ~patientID/tissueID
following the nomenclature of package nlme
(J. C. Pinheiro and Bates 2000; J. Pinheiro, Bates, and R Core Team 2025). The data example Ki67
also contains the metadata including the outcome of interest, e.g., progression free survival PFS
, Her2
, HR
, etc. Detailed information about the groupedHyperframe
class may be found in package groupedHyperframe
vignettes (RPubs, CRAN
), section Grouped Hyper Data Frame.
data(Ki67, package = 'groupedHyperframe')
Ki67#> Grouped Hyperframe: ~patientID/tissueID
#>
#> 645 tissueID nested in
#> 622 patientID
#>
#> Preview of first 10 (or less) rows:
#>
#> logKi67 tissueID Tstage PFS recfreesurv_mon recurrence adj_rad adj_chemo
#> 1 (numeric) TJUe_I17 2 100+ 100 0 FALSE FALSE
#> 2 (numeric) TJUe_G17 1 22 22 1 FALSE FALSE
#> 3 (numeric) TJUe_F17 1 99+ 99 0 FALSE NA
#> 4 (numeric) TJUe_D17 1 99+ 99 0 FALSE TRUE
#> 5 (numeric) TJUe_J18 1 112 112 1 TRUE TRUE
#> 6 (numeric) TJUe_N17 4 12 12 1 TRUE FALSE
#> 7 (numeric) TJUe_J17 2 64+ 64 0 FALSE FALSE
#> 8 (numeric) TJUe_F19 2 56+ 56 0 FALSE FALSE
#> 9 (numeric) TJUe_P19 2 79+ 79 0 FALSE FALSE
#> 10 (numeric) TJUe_O19 2 26 26 1 FALSE TRUE
#> histology Her2 HR node race age patientID
#> 1 3 TRUE TRUE TRUE White 66 PT00037
#> 2 3 FALSE TRUE FALSE Black 42 PT00039
#> 3 3 FALSE TRUE FALSE White 60 PT00040
#> 4 3 FALSE TRUE TRUE White 53 PT00042
#> 5 3 FALSE TRUE TRUE White 52 PT00054
#> 6 2 TRUE TRUE TRUE Black 51 PT00059
#> 7 3 FALSE TRUE TRUE Asian 50 PT00062
#> 8 2 TRUE TRUE TRUE White 37 PT00068
#> 9 3 TRUE TRUE FALSE White 68 PT00082
#> 10 2 TRUE TRUE FALSE Black 55 PT00084
Function aggregate_quantile()
first converts each element of the numeric-hypercolumn logKi67
into sample quantile
s at a pre-specified grid of prob
abilities
\{p_k, k=1,\cdots,K \} \in [0,1], then aggregates the quantiles of multiple tissueID
’s per patientID
by point-wise means (default of parameter f_aggr_
). Note that the aggregation must be performed at the level of biologically independent clusters, e.g., ~patientID
, to produce independent quantile predictors.
= Ki67 |>
Ki67q aggregate_quantile(by = ~ patientID, probs = seq.int(from = .01, to = .99, by = .01))
The returned object Ki67q
is a hyper data frame hyperframe
with a numeric-hypercolumn of aggregated sample quantiles logKi67.quantile
per patientID
. Users are encouraged to learn more about the function aggregate_quantile()
from package groupedHyperframe
vignettes (RPubs, CRAN
), section Grouped Hyper Data Frame, subsection From data.frame
.
|> head()
Ki67q #> Hyperframe:
#> Tstage PFS recfreesurv_mon recurrence adj_rad adj_chemo histology Her2 HR
#> 1 2 100+ 100 0 FALSE FALSE 3 TRUE TRUE
#> 2 1 22 22 1 FALSE FALSE 3 FALSE TRUE
#> 3 1 99+ 99 0 FALSE NA 3 FALSE TRUE
#> 4 1 99+ 99 0 FALSE TRUE 3 FALSE TRUE
#> 5 1 112 112 1 TRUE TRUE 3 FALSE TRUE
#> 6 4 12 12 1 TRUE FALSE 2 TRUE TRUE
#> node race age patientID logKi67.quantile
#> 1 TRUE White 66 PT00037 (numeric)
#> 2 FALSE Black 42 PT00039 (numeric)
#> 3 FALSE White 60 PT00040 (numeric)
#> 4 TRUE White 53 PT00042 (numeric)
#> 5 TRUE White 52 PT00054 (numeric)
#> 6 TRUE Black 51 PT00059 (numeric)
Linear quantile index (QI) (Equation 1) is a predictor in a functional generalized linear model (James 2002) for outcomes from the exponential family of distributions, or a linear functional Cox model (Gellar et al. 2015) for survival outcomes,
\text{QI}_{i}=\int_{0}^{1} \beta(p)Q_i(p)dp \tag{1}
where Q_i(p) is the (aggregated) sample quantiles logKi67.quantile
for the i-th subject, and \beta(p) is the unknown coefficient function to be estimated. We use function hyper_gam()
to fit a generalized additive model gam
with integrated linear spline-based smoothness estimation (function mgcv::s()
, Wood 2003). This is a scalar-on-function model (Reiss et al. 2017) that predicts a scalar outcome (e.g., progression free survival time PFS[,1L]
) using the aggregated quantiles function as a functional predictor.
= hyper_gam(PFS ~ logKi67.quantile, data = Ki67q) m0
Nonlinear quantile index (nlQI) (Equation 2) is a predictor in the functional generalized additive model (McLean et al. 2014) for outcomes from the exponential family of distributions, or an additive functional Cox model (Cui, Crainiceanu, and Leroux 2021) for survival outcomes.
\text{nlQI}_{i}= \int_{0}^{1} F\big(p, Q_i(p)\big)dp \tag{2}
where F(\cdot,\cdot) is an unknown bivariate twice differentiable function. We use function hyper_gam(., nonlinear = TRUE)
to fit a generalized additive model gam
with tensor product interaction estimation (function mgcv::ti()
, Wood 2006).
= hyper_gam(PFS ~ logKi67.quantile, data = Ki67q, nonlinear = TRUE) m1
S3
Class hyper_gam
The fitted functional model m0
and m1
have the S3
class hyper_gam
, which inherits from the S3
class gam
from package mgcv
(Wood 2017). The hyper_gam
class has an additional attribute,
attr(., 'xname')
, a symbol
of the hypercolumn name.Such inheritance enables the use of all S3
method dispatches on gam
objects from package mgcv
on the hyper_gam
objects.
Function integrandSurface()
creates an interactive htmlwidget
(Vaidyanathan et al. 2023) visualization of the estimated integrand surfaces for the linear (Equation 1) or nonlinear quantile index (Equation 2) using package plotly
(Sievert 2020). The integrand surfaces, defined on p\in[0,1] and q\in\text{range}\big\{Q_i(p), i=1,\cdots,n\big\}, are
\hat{S}(p,q) = \begin{cases} \hat{\beta}(p)\cdot q\\ \hat{F}(p,q) \end{cases} \tag{3}
Also in this interactive visualization are
the contour lines on the integrand surfaces (Equation 3), as well as their projections along the s-axis, i.e., onto the (p,q)-plane (a.k.a., the “floor”);
the estimated linear integrand paths \hat{\beta}(p)Q_i(p) or the nonlinear integrand paths \hat{F}(p, Q_i(p)) on the integrand surfaces (Equation 3);
the sample quantiles Q_i(p), i.e., the projections of the estimated linear or nonlinear integrand path along the s-axis, i.e., onto the (p,q)-plane (a.k.a., the “floor”);
the projections of the estimated linear or nonlinear integrand path along the q-axis, i.e., onto the (p,s)-plane (a.k.a., the “backwall”), so that the area under each projected path is equal to the estimated linear (Equation 1) or nonlinear quantile index (Equation 2).
Figure 1 is an interactive htmlwidget
visualization of the nonlinear integrand surface, integrand paths and their projections to the “floor” and “backwall”. Users should remove the argument n
in integrandSurface(, n=101L)
, and use the default n=501L
instead, for a more refined surface. We must use n=101L
to reduce the htmlwidget
object size, in order to comply with CRAN
and/or RPubs file size limit. For the same reason, the interactive visualization of the linear integrand surface is suppressed in this vignette. Users are strongly encouraged to interact with it on their local device.
|> integrandSurface() # please interact with it on your local computer m0
|> integrandSurface(n = 101L) m1
Static illustrations of the estimated integrand surfaces, e.g., the persp
ective (S3
method dispatch persp.hyper_gam()
) and contour
(S3
method dispatch contour.hyper_gam()
) plots, are produced by calling the S3
generics persp()
and contour()
in package graphics
shipped with vanilla R
. These static figures are suppressed to reduce the file size of this vignette.
|> persp() # a static figure m0
|> contour() # a static figure m0
|> persp() # a static figure m1
|> contour() # a static figure m1
Visualization of the integrand surface (Equation 3) in functions integrandSurface()
, persp.hyper_gam()
and contour.hyper_gam()
is inspired by function mgcv::vis.gam()
. Visualization of the integrand paths, as well as their projections on the (p,q)- and (p,s)-plane, is an original idea and design by Tingting Zhan.
Linear and nonlinear quantile indices are the predictors in the functional models (Equation 1) and (Equation 2), respectively. Let’s consider a conventional scenario that we first fit a hyper_gam
model to the training data set, then compute the quantile index predictors in the training and/or test data set using the training model.
First, we partition the 622 patients in hyper data frame Ki67q
into a training data set with 498 patients and a test data set with 124 patients, i.e., a 80% vs. 20% partition.
set.seed(16); id = Ki67q |> nrow() |> seq_len() |> caret::createDataPartition(p = .8)
= Ki67q[id[[1L]],] # training set
Ki67q_0 = Ki67q[-id[[1L]],] # test set Ki67q_1
Next, we fit a functional generalized additive model to the the training data set Ki67q_0
,
= hyper_gam(PFS ~ logKi67.quantile, nonlinear = TRUE, data = Ki67q_0) m1a
The S3
method dispatch hyper.gam::predict.hyper_gam()
calculates the quantile index predictors of the training and/or test data set, based on the training model m1a
. The returned value is a numeric
vector
. This is a convenient wrapper and slight modification of the function mgcv::predict.gam()
. The use of S3
generic stats::predict()
, which is typically for predicted values, could be confusing, but we choose to follow the practice and nomenclature of function mgcv::predict.gam()
.
We can, but we should not, use the quantile index predictors of the training data set for downstream analysis, because these quantile index predictors are optimized on the training data set and the results would be optimistically biased.
c('PFS', 'age', 'race')] |>
Ki67q_0[,as.data.frame() |> # invokes spatstat.geom::as.data.frame.hyperframe()
data.frame(nlQI = predict(m1a, newdata = Ki67q_0)) |>
coxph(formula = PFS ~ age + nlQI, data = _)
#> Call:
#> coxph(formula = PFS ~ age + nlQI, data = data.frame(as.data.frame(Ki67q_0[,
#> c("PFS", "age", "race")]), nlQI = predict(m1a, newdata = Ki67q_0)))
#>
#> coef exp(coef) se(coef) z p
#> age -0.023320 0.976950 0.008321 -2.803 0.00507
#> nlQI 1.118713 3.060911 0.343152 3.260 0.00111
#>
#> Likelihood ratio test=23.24 on 2 df, p=8.993e-06
#> n= 498, number of events= 99
Instead, we should use the quantile index predictors computed in the test data set for downstream analysis,
c('PFS', 'age', 'race')] |>
Ki67q_1[,as.data.frame() |> # invokes spatstat.geom::as.data.frame.hyperframe()
data.frame(nlQI = predict(m1a, newdata = Ki67q_1)) |>
coxph(formula = PFS ~ age + nlQI, data = _)
#> Call:
#> coxph(formula = PFS ~ age + nlQI, data = data.frame(as.data.frame(Ki67q_1[,
#> c("PFS", "age", "race")]), nlQI = predict(m1a, newdata = Ki67q_1)))
#>
#> coef exp(coef) se(coef) z p
#> age -0.01636 0.98377 0.01862 -0.879 0.380
#> nlQI 1.21050 3.35516 0.75858 1.596 0.111
#>
#> Likelihood ratio test=3.42 on 2 df, p=0.1805
#> n= 124, number of events= 19
🚧 This section is under construction