Supervised Learning Tools for Deriving Biomarkers based on Single-Cell Data

Tingting Zhan

Inna Chervoneva

2025-06-27

Introduction

This vignette provides examples of using R packages

for deriving single index predictors of scalar outcomes based on spatial and non-spatial single-cell imaging data.

Prerequisite

Experimental (and maybe unstable) features are implemented extremely frequently on Github. Active developers should use the Github version; suggestions and bug reports are welcome!

remotes::install_github('tingtingzhan/groupedHyperframe')
remotes::install_github('tingtingzhan/hyper.gam')

Stable releases to CRAN are typically updated every 2 to 3 months, or when the authors have an upcoming manuscript in the peer-reviewing process. Developers should not use the CRAN version!

utils::install.packages('groupedHyperframe')
utils::install.packages('hyper.gam')

Dependencies

Package hyper.gam Imports packages

Package groupedHyperframe Imports packages

Since package spatstat.explore version 3.4-3.004, we have a Registered S3 method overwritten when loading packages groupedHyperframe and hyper.gam. The function name clash is between spatstat.explore::plot.roc() and pROC::plot.roc(). The package pROC (Robin et al. 2011) is among the Imports of package caret (Kuhn 2008).

Getting Started

Examples in this vignette require that the search path has

library(groupedHyperframe)
library(hyper.gam)
#> Registered S3 method overwritten by 'pROC':
#>   method   from            
#>   plot.roc spatstat.explore
library(survival)

Terms and Abbreviations

Term / Abbreviation Description
|> Forward pipe operator introduced since R 4.1.0, as well as the _ placeholder
:: Explicitly-namespaced function
attr, attributes Attributes
contour Contour line, https://en.wikipedia.org/wiki/Contour_line
createDataPartition Test vs. training data set partition, from package caret (Kuhn 2008)
csv, read.csv (Read) comma-separated-value files
coxph Cox proportional hazards model, from package survival (Therneau 2024)
gam Generalized additive models (GAM), from package mgcv (Wood 2017)
groupedHyperframe Grouped hyper data frame, from package groupedHyperframe (Zhan and Chervoneva 2025a)
hypercolumns, hyperframe (Hyper columns of) hyper data frame, from package spatstat.geom (Baddeley and Turner 2005)
htmlwidget HyperText Markup Language (HTML) widgets, from package htmlwidgets (Vaidyanathan et al. 2023), https://www.htmlwidgets.org, https://plotly.com/r/getting-started/
inherits Class inheritance
L-suffix Create integer constant
mgcv::s (Set up of) spline based smooths (Wood 2003)
mgcv::ti Tensor product interaction (Wood 2006)
persp Perspective plot, https://en.wikipedia.org/wiki/Perspective_(graphical)
PFS Progression/recurrence free survival, https://en.wikipedia.org/wiki/Progression-free_survival
predict Model predictions
predict.gam GAM model predictor
quantile Quantile
S3, generic, methods S3 object oriented system, UseMethod; getS3method; https://adv-r.hadley.nz/s3.html
search Search path for R objects
Surv Survival, i.e., time-to-event, object, from package survival (Therneau 2024)
symbol, name Names and symbols to refer to R objects

Acknowledgement

The authors thank Erjia Cui for his contribution to function hyper_gam().

This work is supported by National Institutes of Health, U.S. Department of Health and Human Services grants

Data Structure

Single-cell multiplex immuno-fluorescence immunohistochemistry (mIF-IHC) imaging data are the result of digital processing of the microscopic images of tissue stained with selected antibodies. Quantitative pathology platforms, e.g., Akoya or QuPath, support cell segmentation of mIF-IHC images and quantification of the mean protein expression in each cell. The cell centroid coordinates and cell signal intensities (CSIs) for each stained protein are usually extracted as individual comma-separated values .csv files. For each cell in a tissue image, the data include the cell centroid coordinates and cell signal intensity (CSI) for each quantified protein expression. The data may have multiple levels of hierarchical clustering. For example, single cells are clustered within a Region of Interest (ROI) or a tissue core, ROIs are clustered within a tissue or tissue cores are clustered within a patient.

Quantile Index

Applications based on the Quantile Index (QI) methodology is described in our peer-reviewed publications Yi et al. (2023a); Yi et al. (2023b); Yi et al. (2025).

Example

Data example Ki67 included in package groupedHyperframe (Github, CRAN) is a grouped hyper data frame, an extension of the hyper data frame hyperframe object defined in package spatstat.geom (Baddeley, Rubak, and Turner 2015; Baddeley and Turner 2005). The numeric-hypercolumn logKi67, whose elements are numeric vectors of different lengths, contains the log-transformed Ki67 protein expression CSIs in each tissueID nested in patientID. Such nested grouping structure is denoted by ~patientID/tissueID following the nomenclature of package nlme (J. C. Pinheiro and Bates 2000; J. Pinheiro, Bates, and R Core Team 2025). The data example Ki67 also contains the metadata including the outcome of interest, e.g., progression free survival PFS, Her2, HR, etc. Detailed information about the groupedHyperframe class may be found in package groupedHyperframe vignettes (RPubs, CRAN), section Grouped Hyper Data Frame.

data(Ki67, package = 'groupedHyperframe')
Ki67
#> Grouped Hyperframe: ~patientID/tissueID
#> 
#> 645 tissueID nested in
#> 622 patientID
#> 
#> Preview of first 10 (or less) rows:
#> 
#>      logKi67 tissueID Tstage  PFS recfreesurv_mon recurrence adj_rad adj_chemo
#> 1  (numeric) TJUe_I17      2 100+             100          0   FALSE     FALSE
#> 2  (numeric) TJUe_G17      1   22              22          1   FALSE     FALSE
#> 3  (numeric) TJUe_F17      1  99+              99          0   FALSE        NA
#> 4  (numeric) TJUe_D17      1  99+              99          0   FALSE      TRUE
#> 5  (numeric) TJUe_J18      1  112             112          1    TRUE      TRUE
#> 6  (numeric) TJUe_N17      4   12              12          1    TRUE     FALSE
#> 7  (numeric) TJUe_J17      2  64+              64          0   FALSE     FALSE
#> 8  (numeric) TJUe_F19      2  56+              56          0   FALSE     FALSE
#> 9  (numeric) TJUe_P19      2  79+              79          0   FALSE     FALSE
#> 10 (numeric) TJUe_O19      2   26              26          1   FALSE      TRUE
#>    histology  Her2   HR  node  race age patientID
#> 1          3  TRUE TRUE  TRUE White  66   PT00037
#> 2          3 FALSE TRUE FALSE Black  42   PT00039
#> 3          3 FALSE TRUE FALSE White  60   PT00040
#> 4          3 FALSE TRUE  TRUE White  53   PT00042
#> 5          3 FALSE TRUE  TRUE White  52   PT00054
#> 6          2  TRUE TRUE  TRUE Black  51   PT00059
#> 7          3 FALSE TRUE  TRUE Asian  50   PT00062
#> 8          2  TRUE TRUE  TRUE White  37   PT00068
#> 9          3  TRUE TRUE FALSE White  68   PT00082
#> 10         2  TRUE TRUE FALSE Black  55   PT00084

1️⃣ Compute Aggregated Quantiles

Function aggregate_quantile() first converts each element of the numeric-hypercolumn logKi67 into sample quantiles at a pre-specified grid of probabilities \{p_k, k=1,\cdots,K \} \in [0,1], then aggregates the quantiles of multiple tissueID’s per patientID by point-wise means (default of parameter f_aggr_). Note that the aggregation must be performed at the level of biologically independent clusters, e.g., ~patientID, to produce independent quantile predictors.

Ki67q = Ki67 |>
  aggregate_quantile(by = ~ patientID, probs = seq.int(from = .01, to = .99, by = .01))

The returned object Ki67q is a hyper data frame hyperframe with a numeric-hypercolumn of aggregated sample quantiles logKi67.quantile per patientID. Users are encouraged to learn more about the function aggregate_quantile() from package groupedHyperframe vignettes (RPubs, CRAN), section Grouped Hyper Data Frame, subsection From data.frame.

Ki67q |> head()
#> Hyperframe:
#>   Tstage  PFS recfreesurv_mon recurrence adj_rad adj_chemo histology  Her2   HR
#> 1      2 100+             100          0   FALSE     FALSE         3  TRUE TRUE
#> 2      1   22              22          1   FALSE     FALSE         3 FALSE TRUE
#> 3      1  99+              99          0   FALSE        NA         3 FALSE TRUE
#> 4      1  99+              99          0   FALSE      TRUE         3 FALSE TRUE
#> 5      1  112             112          1    TRUE      TRUE         3 FALSE TRUE
#> 6      4   12              12          1    TRUE     FALSE         2  TRUE TRUE
#>    node  race age patientID logKi67.quantile
#> 1  TRUE White  66   PT00037        (numeric)
#> 2 FALSE Black  42   PT00039        (numeric)
#> 3 FALSE White  60   PT00040        (numeric)
#> 4  TRUE White  53   PT00042        (numeric)
#> 5  TRUE White  52   PT00054        (numeric)
#> 6  TRUE Black  51   PT00059        (numeric)

2️⃣ Estimate Integrand Surface

Linear quantile index (QI) (Equation 1) is a predictor in a functional generalized linear model (James 2002) for outcomes from the exponential family of distributions, or a linear functional Cox model (Gellar et al. 2015) for survival outcomes,

\text{QI}_{i}=\int_{0}^{1} \beta(p)Q_i(p)dp \tag{1}

where Q_i(p) is the (aggregated) sample quantiles logKi67.quantile for the i-th subject, and \beta(p) is the unknown coefficient function to be estimated. We use function hyper_gam() to fit a generalized additive model gam with integrated linear spline-based smoothness estimation (function mgcv::s(), Wood 2003). This is a scalar-on-function model (Reiss et al. 2017) that predicts a scalar outcome (e.g., progression free survival time PFS[,1L]) using the aggregated quantiles function as a functional predictor.

m0 = hyper_gam(PFS ~ logKi67.quantile, data = Ki67q)

Nonlinear quantile index (nlQI) (Equation 2) is a predictor in the functional generalized additive model (McLean et al. 2014) for outcomes from the exponential family of distributions, or an additive functional Cox model (Cui, Crainiceanu, and Leroux 2021) for survival outcomes.

\text{nlQI}_{i}= \int_{0}^{1} F\big(p, Q_i(p)\big)dp \tag{2}

where F(\cdot,\cdot) is an unknown bivariate twice differentiable function. We use function hyper_gam(., nonlinear = TRUE) to fit a generalized additive model gam with tensor product interaction estimation (function mgcv::ti(), Wood 2006).

m1 = hyper_gam(PFS ~ logKi67.quantile, data = Ki67q, nonlinear = TRUE)

S3 Class hyper_gam

The fitted functional model m0 and m1 have the S3 class hyper_gam, which inherits from the S3 class gam from package mgcv (Wood 2017). The hyper_gam class has an additional attribute,

Such inheritance enables the use of all S3 method dispatches on gam objects from package mgcv on the hyper_gam objects.

Visualization

Function integrandSurface() creates an interactive htmlwidget (Vaidyanathan et al. 2023) visualization of the estimated integrand surfaces for the linear (Equation 1) or nonlinear quantile index (Equation 2) using package plotly (Sievert 2020). The integrand surfaces, defined on p\in[0,1] and q\in\text{range}\big\{Q_i(p), i=1,\cdots,n\big\}, are

\hat{S}(p,q) = \begin{cases} \hat{\beta}(p)\cdot q\\ \hat{F}(p,q) \end{cases} \tag{3}

Also in this interactive visualization are

Figure 1 is an interactive htmlwidget visualization of the nonlinear integrand surface, integrand paths and their projections to the “floor” and “backwall”. Users should remove the argument n in integrandSurface(, n=101L), and use the default n=501L instead, for a more refined surface. We must use n=101L to reduce the htmlwidget object size, in order to comply with CRAN and/or RPubs file size limit. For the same reason, the interactive visualization of the linear integrand surface is suppressed in this vignette. Users are strongly encouraged to interact with it on their local device.

m0 |> integrandSurface() # please interact with it on your local computer 
m1 |> integrandSurface(n = 101L)
Figure 1: Nonlinear integrand surface, integrand paths and their projections to the “floor” and “backwall”

Static illustrations of the estimated integrand surfaces, e.g., the perspective (S3 method dispatch persp.hyper_gam()) and contour (S3 method dispatch contour.hyper_gam()) plots, are produced by calling the S3 generics persp() and contour() in package graphics shipped with vanilla R. These static figures are suppressed to reduce the file size of this vignette.

m0 |> persp() # a static figure
m0 |> contour() # a static figure
m1 |> persp() # a static figure
m1 |> contour() # a static figure

Visualization of the integrand surface (Equation 3) in functions integrandSurface(), persp.hyper_gam() and contour.hyper_gam() is inspired by function mgcv::vis.gam(). Visualization of the integrand paths, as well as their projections on the (p,q)- and (p,s)-plane, is an original idea and design by Tingting Zhan.

3️⃣ Compute Quantile Index Predictor

Linear and nonlinear quantile indices are the predictors in the functional models (Equation 1) and (Equation 2), respectively. Let’s consider a conventional scenario that we first fit a hyper_gam model to the training data set, then compute the quantile index predictors in the training and/or test data set using the training model.

First, we partition the 622 patients in hyper data frame Ki67q into a training data set with 498 patients and a test data set with 124 patients, i.e., a 80% vs. 20% partition.

set.seed(16); id = Ki67q |> nrow() |> seq_len() |> caret::createDataPartition(p = .8)
Ki67q_0 = Ki67q[id[[1L]],] # training set
Ki67q_1 = Ki67q[-id[[1L]],] # test set

Next, we fit a functional generalized additive model to the the training data set Ki67q_0,

m1a = hyper_gam(PFS ~ logKi67.quantile, nonlinear = TRUE, data = Ki67q_0)

The S3 method dispatch hyper.gam::predict.hyper_gam() calculates the quantile index predictors of the training and/or test data set, based on the training model m1a. The returned value is a numeric vector. This is a convenient wrapper and slight modification of the function mgcv::predict.gam(). The use of S3 generic stats::predict(), which is typically for predicted values, could be confusing, but we choose to follow the practice and nomenclature of function mgcv::predict.gam().

We can, but we should not, use the quantile index predictors of the training data set for downstream analysis, because these quantile index predictors are optimized on the training data set and the results would be optimistically biased.

Optimistically biased!!
Ki67q_0[,c('PFS', 'age', 'race')] |> 
  as.data.frame() |> # invokes spatstat.geom::as.data.frame.hyperframe()
  data.frame(nlQI = predict(m1a, newdata = Ki67q_0)) |>
  coxph(formula = PFS ~ age + nlQI, data = _)
#> Call:
#> coxph(formula = PFS ~ age + nlQI, data = data.frame(as.data.frame(Ki67q_0[, 
#>     c("PFS", "age", "race")]), nlQI = predict(m1a, newdata = Ki67q_0)))
#> 
#>           coef exp(coef)  se(coef)      z       p
#> age  -0.023320  0.976950  0.008321 -2.803 0.00507
#> nlQI  1.118713  3.060911  0.343152  3.260 0.00111
#> 
#> Likelihood ratio test=23.24  on 2 df, p=8.993e-06
#> n= 498, number of events= 99

Instead, we should use the quantile index predictors computed in the test data set for downstream analysis,

Ki67q_1[,c('PFS', 'age', 'race')] |> 
  as.data.frame() |> # invokes spatstat.geom::as.data.frame.hyperframe()
  data.frame(nlQI = predict(m1a, newdata = Ki67q_1)) |>
  coxph(formula = PFS ~ age + nlQI, data = _)
#> Call:
#> coxph(formula = PFS ~ age + nlQI, data = data.frame(as.data.frame(Ki67q_1[, 
#>     c("PFS", "age", "race")]), nlQI = predict(m1a, newdata = Ki67q_1)))
#> 
#>          coef exp(coef) se(coef)      z     p
#> age  -0.01636   0.98377  0.01862 -0.879 0.380
#> nlQI  1.21050   3.35516  0.75858  1.596 0.111
#> 
#> Likelihood ratio test=3.42  on 2 df, p=0.1805
#> n= 124, number of events= 19

Spatial Index

🚧 This section is under construction

References

Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman; Hall/CRC Press. https://www.routledge.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/p/book/9781482210200/.
Baddeley, Adrian, and Rolf Turner. 2005. spatstat: An R Package for Analyzing Spatial Point Patterns.” Journal of Statistical Software 12 (6): 1–42. https://doi.org/10.18637/jss.v012.i06.
Bengtsson, Henrik. 2025. matrixStats: Functions That Apply to Rows and Columns of Matrices (and to Vectors). https://doi.org/10.32614/CRAN.package.matrixStats.
Borchers, Hans W. 2023. pracma: Practical Numerical Math Functions. https://doi.org/10.32614/CRAN.package.pracma.
Csárdi, Gábor. 2025. Cli: Helpers for Developing Command Line Interfaces. https://doi.org/10.32614/CRAN.package.cli.
Cui, Erjia, Ciprian M. Crainiceanu, and Andrew Leroux. 2021. “Additive Functional Cox Model.” Journal of Computational and Graphical Statistics 30 (3): 780–93. https://doi.org/10.1080/10618600.2020.1853550.
Gellar, Jonathan E., Elizabeth Colantuoni, Dale M. Needham, and Ciprian M. Crainiceanu. 2015. “Cox Regression Models with Functional Covariates for Survival Data.” Statistical Modelling 15 (3): 256–78. https://doi.org/10.1177/1471082X14565526.
James, Gareth M. 2002. “Generalized Linear Models with Functional Predictors.” Journal of the Royal Statistical Society Series B: Statistical Methodology 64 (3): 411–32. https://doi.org/10.1111/1467-9868.00342.
Kuhn, Max. 2008. “Building Predictive Models in R Using the caret Package.” Journal of Statistical Software 28 (5): 1–26. https://doi.org/10.18637/jss.v028.i05.
McLean, Mathew W., Giles Hooker, Ana-Maria Staicu, Fabian Scheipl, and David Ruppert. 2014. “Functional Generalized Additive Models.” Journal of Computational and Graphical Statistics 23 (1): 249–69. https://doi.org/10.1080/10618600.2012.729985.
Pinheiro, José C., and Douglas M. Bates. 2000. Mixed-Effects Models in S and S-PLUS. New York: Springer. https://doi.org/10.1007/b98882.
Pinheiro, José, Douglas Bates, and R Core Team. 2025. nlme: Linear and Nonlinear Mixed Effects Models. https://doi.org/10.32614/CRAN.package.nlme.
Reiss, Philip T., Jeff Goldsmith, Han Lin Shang, and R. Todd Ogden. 2017. “Methods for Scalar-on-Function Regression.” International Statistical Review 85 (2): 228–49. https://doi.org/10.1111/insr.12163.
Robin, Xavier, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez, and Markus Müller. 2011. pROC: An Open-Source Package for R and S+ to Analyze and Compare ROC Curves.” BMC Bioinformatics 12: 77. https://doi.org/10.1186/1471-2105-12-77.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.
Therneau, Terry M. 2024. A Package for Survival Analysis in R. https://CRAN.R-project.org/package=survival.
Vaidyanathan, Ramnath, Yihui Xie, JJ Allaire, Joe Cheng, Carson Sievert, and Kenton Russell. 2023. htmlwidgets: HTML Widgets for R. https://doi.org/10.32614/CRAN.package.htmlwidgets.
Vallejos, R., F. Osorio, and M. Bevilacqua. 2020. Spatial Relationships Between Two Georeferenced Variables: With Applications in r. New York: Springer. http://srb2gv.mat.utfsm.cl/.
Wood, Simon N. 2003. “Thin-Plate Regression Splines.” Journal of the Royal Statistical Society B: Statistical Methodology 65 (1): 95–114. https://doi.org/10.1111/1467-9868.00374.
———. 2006. “Low-Rank Scale-Invariant Tensor Product Smooths for Generalized Additive Mixed Models.” Biometrics 62 (4): 1025–36. https://doi.org/10.1111/j.1541-0420.2006.00574.x.
———. 2017. Generalized Additive Models: An Introduction with R. 2nd ed. Chapman; Hall/CRC.
Yi, Misung, Tingting Zhan, Amy R. Peck, Jeffrey A. Hooke, Albert J. Kovatich, Craig D. Shriver, Hai Hu, Yunguang Sun, Hallgeir Rui, and Inna Chervoneva. 2023a. “Quantile Index Biomarkers Based on Single-Cell Expression Data.” Laboratory Investigation 103 (8): 100158. https://doi.org/10.1016/j.labinv.2023.100158.
———. 2023b. “Selection of Optimal Quantile Protein Biomarkers Based on Cell-Level Immunohistochemistry Data.” BMC Bioinformatics 24 (1): 298. https://doi.org/10.1186/s12859-023-05408-8.
Yi, Misung, Tingting Zhan, Hallgeir Rui, and Inna Chervoneva. 2025. “Functional Protein Biomarkers Based on Distributions of Expression Levels in Single-Cell Imaging Data.” Bioinformatics 41 (5): btaf182. https://doi.org/10.1093/bioinformatics/btaf182.
Zhan, Tingting, and Inna Chervoneva. 2025a. groupedHyperframe: Grouped Hyper Data Frame: An Extension of Hyper Data Frame Object. https://doi.org/10.32614/CRAN.package.groupedHyperframe.
———. 2025b. Hyper.gam: Generalized Additive Models with Hyper Column. https://doi.org/10.32614/CRAN.package.hyper.gam.