Type: | Package |
Title: | Active Learning for Process Monitoring |
Version: | 0.1.0 |
Description: | Implements the methodology introduced in Capezza, Lepore, and Paynabar (2025) <doi:10.1080/00401706.2025.2561744> for process monitoring with limited labeling resources. The package provides functions to (i) simulate data streams with true latent states and multivariate Gaussian observations as done in the paper, (ii) fit partially hidden Markov models (pHMMs) using a constrained Baum-Welch algorithm with partial labels, and (iii) perform stream-based active learning that balances exploration and exploitation to decide whether to request labels in real time. The methodology is particularly suited for statistical process monitoring in industrial applications where labeling is costly. |
Depends: | R (≥ 4.2) |
Imports: | Rcpp, Rfast, mvnfast, rrcov, caTools, abind, pROC, stats |
LinkingTo: | Rcpp, RcppArmadillo |
License: | GPL-3 |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.3 |
Suggests: | covr, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
NeedsCompilation: | yes |
Packaged: | 2025-09-30 12:31:47 UTC; christian.capezza |
Author: | Christian Capezza [aut, cre], Antonio Lepore [aut], Kamran Paynabar [aut] |
Maintainer: | Christian Capezza <christian.capezza@unina.it> |
Repository: | CRAN |
Date/Publication: | 2025-10-07 18:30:21 UTC |
ActiveLearning4SPM: Active Learning for Process Monitoring
Description
Implements the methodology introduced in Capezza, Lepore, and Paynabar (2025) doi:10.1080/00401706.2025.2561744 for process monitoring with limited labeling resources. The package provides functions to (i) simulate data streams with true latent states and multivariate Gaussian observations as done in the paper, (ii) fit partially hidden Markov models (pHMMs) using a constrained Baum-Welch algorithm with partial labels, and (iii) perform stream-based active learning that balances exploration and exploitation to decide whether to request labels in real time. The methodology is particularly suited for statistical process monitoring in industrial applications where labeling is costly.
Author(s)
Maintainer: Christian Capezza christian.capezza@unina.it
Authors:
Antonio Lepore
Kamran Paynabar
References
Capezza, C., Lepore, A., and Paynabar, K. (2025). Stream-Based Active Learning for Process Monitoring. Technometrics. <doi:10.1080/00401706.2025.2561744>
Stream-Based Active Learning with a Partially Hidden Markov Model (pHMM)
Description
Implements the stream-based active learning strategy of Capezza, Lepore, and Paynabar (2025) for process monitoring with partially observed states. At each time step, the method fits a pHMM to the available data, and balances between exploitation (reducing predictive uncertainty) and exploration (detecting potential out-of-control shifts) to decide whether to request the true label of the current observation. Labeling requests are constrained by a user-defined budget.
Usage
active_learning_pHMM(
y,
true_x,
T0,
B = 0.1,
weight_exploration = 0.5,
lambda_MEWMA = 0.3,
verbose = FALSE
)
Arguments
y |
A numeric matrix of dimension |
true_x |
Integer vector of true states of length |
T0 |
Integer. Number of initial observations assumed to be labeled as in-control (state 1). |
B |
Numeric between 0 and 1. Labeling budget, expressed as the maximum
fraction of observations (after the first |
weight_exploration |
Numeric between 0 and 1. Weight assigned to the
exploration criterion. The exploitation weight is computed as
|
lambda_MEWMA |
Numeric in (0,1). Smoothing parameter for the MEWMA
statistic used in the exploration criterion. Default is |
verbose |
Logical. If |
Details
The exploitation criterion is based on the entropy of the state sequence, while the exploration criterion uses a multivariate exponentially weighted moving average (MEWMA) statistic. The two criteria are combined with a user-defined weighting, and labeling stops when the budget is exhausted or at the end of the data stream.
Value
A list with components:
-
decision
: character vector indicating the action taken at each time ("label_exploitation"
,"label_exploration"
, or predicted state). -
xlabeled
: updated state sequence including acquired labels. -
xhat
: final predicted state sequence. -
scores
: classification performance metrics (accuracy, precision, recall, F1, AUC) computed against the true states.
Note
This function is intended for simulation studies where the entire
observation sequence y
and the corresponding true states x
are available in advance. The function uses these to evaluate the
active learning strategy under a given budgets. In real-time applications,
data and labels would arrive sequentially, and labels would only be obtained
if requested by the strategy.
References
Capezza, C., Lepore, A., & Paynabar, K. (2025). Stream-Based Active Learning for Process Monitoring. Technometrics. <doi:10.1080/00401706.2025.2561744>.
Examples
library(ActiveLearning4SPM)
set.seed(123)
dat <- simulate_stream(T0 = 50, TT = 100, T_min_IC = 20, T_max_IC = 30)
out <- active_learning_pHMM(y = dat$y,
true_x = dat$x,
T0 = 50,
B = 0.1)
table(out$decision)
out$scores$f1
Fit a Partially Hidden Markov Model (pHMM)
Description
Fits a partially hidden Markov model (pHMM) to multivariate time series
observations y
with partially observed process states x
, using a
constrained Baum-Welch algorithm. The function allows the user to provide
custom initial parameters, and supports constraints on known means and/or
covariances, as well as equal or diagonal covariance structures.
Usage
fit_pHMM(
y,
xlabeled,
nstates,
ppi_start = NULL,
A_start = NULL,
mean_start,
covariance_start = NULL,
known_mean = NULL,
known_covariance = NULL,
equal_covariance = FALSE,
covariance_structure = "full",
max_iter = 200,
tol = 0.001,
verbose = FALSE
)
Arguments
y |
A numeric matrix of dimension |
xlabeled |
An integer vector of length |
nstates |
Integer. The total number of hidden states to fit. |
ppi_start |
Numeric vector of length |
A_start |
Numeric |
mean_start |
List of length |
covariance_start |
List of covariance matrices for the emission
distributions. Must be of length |
known_mean |
Optional list of known mean vectors. Use |
known_covariance |
Optional list of known covariance matrices. Use
|
equal_covariance |
Logical. If |
covariance_structure |
Character string specifying the covariance
structure. Either |
max_iter |
Maximum number of EM iterations. Default is 200. |
tol |
Convergence tolerance for log-likelihood and parameter change.
Default is |
verbose |
Logical. If |
Value
A list with components:
-
y
,xlabeled
: the input data. -
log_lik
,log_lik_vec
: final and trace of log-likelihood. -
iter
: number of EM iterations performed. -
logB
,log_alpha
,log_beta
,log_gamma
,log_xi
: posterior quantities from the Baum-Welch algorithm. -
logAhat
,mean_hat
,covariance_hat
,log_pi_hat
: estimated model parameters. -
AIC
,BIC
: information criteria for model selection.
References
Capezza, C., Lepore, A., & Paynabar, K. (2025). Stream-Based Active Learning for Process Monitoring. Technometrics. <doi:10.1080/00401706.2025.2561744>.
Examples
library(ActiveLearning4SPM)
set.seed(123)
dat <- simulate_stream(T0 = 100, TT = 500)
y <- dat$y
xlabeled <- dat$x
d <- ncol(dat$y)
xlabeled[sample(1:600, 300)] <- NA
out <- fit_pHMM(y = y,
xlabeled = xlabeled,
nstates = 3,
mean_start = list(rep(0, d), rep(1, d), rep(-1, d)),
equal_covariance = TRUE)
out$AIC
Automatic Initialization and Fitting of a Partially Hidden Markov Model (pHMM)
Description
Fits a partially hidden Markov model (pHMM) to multivariate time series
observations y
with partially observed states x
, using the
constrained Baum-Welch algorithm. Unlike fit_pHMM
, this function
does not require user-specified initial parameters. Instead, it implements
a customized initialization strategy designed for process monitoring with
highly imbalanced classes, as described in the supplementary material of
Capezza, Lepore, and Paynabar (2025).
Usage
fit_pHMM_auto(
y = y,
xlabeled = xlabeled,
tol = 0.001,
max_nstates = 5,
ntry = 10
)
Arguments
y |
A numeric matrix of dimension |
xlabeled |
An integer vector of length |
tol |
Convergence tolerance for log-likelihood and parameter change.
Default is |
max_nstates |
Maximum number of hidden states to consider during the
initialization procedure. Default is |
ntry |
Number of candidate initializations for each new state. Default
is |
Details
The initialization procedure addresses the multimodality of the likelihood and the sensitivity of the Baum-Welch algorithm to starting values:
A one-state model (in-control process) is first fitted using robust estimators of location and scatter.
To introduce an additional state, candidate mean vectors are selected from observations that are least well represented by the current model. This is achieved by computing moving averages of the data over window lengths
k = 1, \ldots, 9
, and then calculating the Mahalanobis distances of these smoothed points to existing state means.The
ntry
observations with the largest minimum distances are retained as candidate initializations for the new state's mean.For each candidate, a pHMM is initialized with:
Existing means fixed to their previous estimates.
The new state's mean set to the candidate vector.
A shared covariance matrix fixed to the robust estimate from the in-control state.
Initial state distribution
\pi
concentrated on the IC state.Transition matrix with diagonal entries
1 - 0.01 (N-1)
and off-diagonal entries0.01
.
Each initialized model is fitted with the Baum-Welch algorithm, and the one achieving the highest log-likelihood is retained.
This process is repeated until up to
max_nstates
states are considered.
This strategy leverages prior process knowledge (dominant in-control regime) and focuses the search on under-represented regions of the data space, which improves convergence and reduces sensitivity to random initialization.
Value
A list with the same structure as returned by fit_pHMM
:
-
y
,xlabeled
: the input data. -
log_lik
,log_lik_vec
: final and trace of log-likelihood. -
iter
: number of EM iterations performed. -
logB
,log_alpha
,log_beta
,log_gamma
,log_xi
: posterior quantities from the Baum-Welch algorithm. -
logAhat
,mean_hat
,covariance_hat
,log_pi_hat
: estimated model parameters. -
AIC
,BIC
: information criteria for model selection.
References
Capezza, C., Lepore, A., & Paynabar, K. (2025). Stream-Based Active Learning for Process Monitoring. Technometrics. <doi:10.1080/00401706.2025.2561744>.
Supplementary Material, Section B: Initialization of the Partially Hidden Markov Model. Available at <https://doi.org/10.1080/00401706.2025.2561744>.
Examples
library(ActiveLearning4SPM)
set.seed(123)
dat <- simulate_stream(T0 = 100, TT = 500)
y <- dat$y
xlabeled <- dat$x
d <- ncol(dat$y)
xlabeled[sample(1:600, 300)] <- NA
obj <- fit_pHMM_auto(y = y,
xlabeled = xlabeled,
tol = 1e-3,
max_nstates = 5,
ntry = 10)
obj$AIC
Simulate Process Monitoring Data for Stream-Based Active Learning
Description
Generate a sequence of latent states and corresponding multivariate Gaussian observations for process monitoring. The process has three possible states:
state 1: in-control (IC),
state 2: out-of-control (OC),
state 3: out-of-control (OC).
Usage
simulate_stream(
d = 10,
TT = 500,
T0 = 100,
T_min_IC = 60,
T_max_IC = 85,
T_OC = 5,
mean = NULL,
covariance = NULL
)
Arguments
d |
Integer. Number of variables (dimension of the multivariate observations). Default is 10. |
TT |
Integer. Length of the sequence after the initial IC portion. Default is 500. |
T0 |
Integer. Length of the initial IC sequence known to belong to state 1. Default is 100. |
T_min_IC , T_max_IC |
Integers. Minimum and maximum length of consecutive IC observations before switching to an OC state. |
T_OC |
Integer. Fixed length of each OC state sequence. |
mean |
List of three numeric vectors of length |
covariance |
List of three |
Details
The first T0
observations are fixed in state 1 (IC). Then, in the
following TT
observations, only state 2 appears in the first half, and
only state 3 appears in the second half. Within each half, runs of state 1
(IC) of random length between T_min_IC
and T_max_IC
alternate
with fixed-length runs of the corresponding OC state of length T_OC
.
Value
A list with elements:
x |
Integer vector of latent states of length |
y |
Matrix of simulated multivariate observations with |
Examples
library(ActiveLearning4SPM)
sim <- simulate_stream()
table(sim$x)