Title: Smooth L0 Penalty Approximations for Gaussian Graphical Models
Version: 0.0.1
Date: 2026-03-21
Description: Provides smooth approximations to the L0 norm penalty for estimating sparse Gaussian graphical models (GGMs). Network estimation is performed using the Local Linear Approximation (LLA) framework (Fan & Li, 2001 <doi:10.1198/016214501753382273>; Zou & Li, 2008 <doi:10.1214/009053607000000802>) with five penalty functions: arctangent (Wang & Zhu, 2016 <doi:10.1155/2016/6495417>), EXP (Wang, Fan, & Zhu, 2018 <doi:10.1007/s10463-016-0588-3>), Gumbel, Log (Candes, Wakin, & Boyd, 2008 <doi:10.1007/s00041-008-9045-x>), and Weibull. Adaptive penalty parameters for EXP, Gumbel, and Weibull are estimated via maximum likelihood, and model selection uses information criteria including AIC, BIC, and EBIC (Extended BIC). Simulation functions generate multivariate normal data from GGMs with stochastic block model or small-world (Watts-Strogatz) network structures.
Depends: R (≥ 3.5.0)
License: AGPL (≥ 3.0)
Encoding: UTF-8
LazyData: true
NeedsCompilation: yes
Imports: igraph, glasso, glassoFast, Matrix, methods, psych, stats
Copyright: See inst/COPYRIGHTS for details
BugReports: https://github.com/AlexChristensen/L0ggm/issues
RoxygenNote: 7.3.3
Packaged: 2026-03-21 16:01:20 UTC; alextops
Author: Alexander Christensen ORCID iD [aut, cre], Jeongwon Choi ORCID iD [ctb], John Fox [cph, ctb] (Original implementation of polyserial correlations in auto_correlate.R), Yves Rosseel [cph, ctb] (Original implementation of rmsea_ci in network_fit.R), Alexander Robitzsch [cph, ctb] (C++ implementation of Drezner-Wesolowsky bivariate normal CDF in polychoric_matrix.c), David Blackman [ctb] (Original xoshiro.c implementation), Sebastiano Vigna [ctb] (Original xoshiro.c implementation), John Burkardt [cph, ctb] (Original ziggurat.c implementation)
Maintainer: Alexander Christensen <alexpaulchristensen@gmail.com>
Repository: CRAN
Date/Publication: 2026-03-26 09:30:27 UTC

L0ggm-package

Description

Implements L0 norm regularization penalties for Gaussian graphical models

Author(s)

Alexander P. Christensen <alexpaulchristensen@gmail.com>

See Also

Useful links:


Automatic correlations

Description

Automatically computes the proper correlations between continuous and categorical variables. NA values are not treated as categories

Usage

auto_correlate(
  data,
  corr = c("kendall", "pearson", "spearman"),
  ordinal_categories = 7,
  forcePD = TRUE,
  na_data = c("pairwise", "listwise"),
  empty_method = c("none", "zero", "all"),
  empty_value = c("none", "point_five", "one_over"),
  forceReturn = FALSE,
  verbose = FALSE,
  ...
)

Arguments

data

Matrix or data frame. Should consist only of variables to be used in the analysis

corr

Character (length = 1). The standard correlation method to be used. Defaults to "pearson". Using "pearson" will compute polychoric, tetrachoric, polyserial, and biserial correlations for categorical and categorical/continuous correlations by default. To obtain "pearson" correlations regardless, use cor. Other options of "kendall" and "spearman" are provided for completeness and use cor

ordinal_categories

Numeric (length = 1). Up to the number of categories before a variable is considered continuous. Defaults to 7 categories before 8 is considered continuous

forcePD

Boolean (length = 1). Whether positive definite matrix should be enforced. Defaults to TRUE

na_data

Character (length = 1). How should missing data be handled? Defaults to "pairwise". Available options:

  • "pairwise" — Computes correlation for all available cases between two variables

  • "listwise" — Computes correlation for all complete cases in the dataset

empty_method

Character (length = 1). Method for empty cell correction in polychoric_matrix. Defaults to "none" Available options:

  • "none" — Adds no value (empty_value = "none") to the empirical joint frequency table between two variables

  • "zero" — Adds empty_value to the cells with zero in the joint frequency table between two variables

  • "all" — Adds empty_value to all in the joint frequency table between two variables

empty_value

Character (length = 1). Value to add to the joint frequency table cells in polychoric_matrix. Defaults to "none". Accepts numeric values between 0 and 1 or specific methods:

  • "none" — Adds no value (0) to the empirical joint frequency table between two variables

  • "point_five" — Adds 0.5 to the cells defined by empty_method

  • "one_over" — Adds 1 / n where n equals the number of cells based on empty_method. For empty_method = "zero", n equals the number of zero cells

forceReturn

Boolean (length = 1). Whether correlation matrix should be forced to return. Defaults to FALSE. Set to TRUE to receive the correlation matrix "as is"

verbose

Boolean (length = 1). Whether messages should be printed. Defaults to FALSE

...

Not actually used but makes it easier for general functionality in the package

Value

A symmetric numeric matrix of dimension p \times p, where p is the number of variables in data, with values in [-1, 1] and ones on the diagonal. The correlation method used for each pair of variables depends on their types when corr = "pearson" (the default): polychoric for ordinal-ordinal pairs, polyserial for ordinal-continuous pairs, and Pearson's for continuous-continuous pairs, where ordinal means having at most ordinal_categories unique values. When corr is "kendall" or "spearman", cor is used for all pairs regardless of variable type. Row and column names are inherited from data. When forcePD = TRUE (default) and the resulting matrix is not positive definite, the nearest positive definite matrix is returned via nearPD.

Author(s)

Alexander P. Christensen <alexpaulchristensen@gmail.com>

Examples

# Obtain correlations
R <- auto_correlate(basic_smallworld)


Toy Small-world Network Data Example

Description

A toy small-world network example (n = 500)

Usage

data(basic_smallworld)

Format

A 500 \times 20 continuous response matrix

Examples

# Generated using:
# basic_smallworld <- simulate_smallworld(
#  nodes = 20, # 20 nodes in the network
#  density = 0.30, # moderate initial lattice connectivity
#  rewire = 0.20, # 20% rewiring probability
#  sample_size = 500 # number of cases = 500
# )$data
data("basic_smallworld")


Categorize Continuous Data

Description

Categorizes continuous data based on Garrido, Abad and Ponsoda (2011; see references). Categorical data with 2 to 6 categories can include skew between -2 to 2 in increments of 0.05

Usage

categorize(data, categories, skew_value = 0)

Arguments

data

Numeric (length = n). A vector of continuous data with n values. For matrices, use apply

categories

Numeric (length = 1). Number of categories to create. Between 2 and 6 categories can be used with skew

skew_value

Numeric (length = 1). Value of skew. Ranges between -2 to 2 in increments of 0.05. Skews not in this sequence will be converted to the nearest value in this sequence. Defaults to 0 or no skew

Value

An integer vector of length n with values from 1 to categories, giving the category assignment for each observation. Category 1 corresponds to the lowest values of data and category categories to the highest. When categories > 6, cut is used and skew is not applied.

Author(s)

Maria Dolores Nieto Canaveras <mnietoca@nebrija.es>, Luis Eduardo Garrido <luisgarrido@pucmm.edu>, Hudson Golino <hfg9s@virginia.edu>, Alexander P. Christensen <alexpaulchristensen@gmail.com>

References

Garrido, L. E., Abad, F. J., & Ponsoda, V. (2011).
Performance of Velicer’s minimum average partial factor retention method with categorical variables.
Educational and Psychological Measurement, 71(3), 551-570.

Golino, H., Shi, D., Christensen, A. P., Garrido, L. E., Nieto, M. D., Sadana, R., ... & Martinez-Molina, A. (2020). Investigating the performance of exploratory graph analysis and traditional techniques to identify the number of latent factors: A simulation and tutorial. Psychological Methods, 25(3), 292-320.

Examples

# Dichotomous data (no skew)
dichotomous <- categorize(
  data = rnorm(1000),
  categories = 2
)

# Dichotomous data (with positive skew)
dichotomous_skew <- categorize(
  data = rnorm(1000),
  categories = 2,
  skew_value = 1.25
)

# 5-point Likert scale (no skew)
five_likert <- categorize(
  data = rnorm(1000),
  categories = 5
)

# 5-point Likert scale (negative skew)
five_likert <- categorize(
  data = rnorm(1000),
  categories = 5,
  skew_value = -0.45
)


Confusion Matrix Metrics for Edge Comparison and Recovery

Description

Computes many commonly used confusion matrix metrics

Usage

edge_confusion(
  base,
  comparison,
  metric = c("all", "sen", "spec", "ppv", "npv", "fdr", "fom", "ba", "f1", "csi", "mcc"),
  full.names = FALSE
)

Arguments

base

Matrix or data frame. Network that will be treated as the "ground truth" such that a false positive represents an edge that is present in comparison but not in this network

comparison

Matrix or data frame. Network that will be treated as the estimator such that a false positive represents an edge that is present in this network but not in base

metric

Character vector. Defaults to "all" metrics. Available options:

  • "all" — All available metrics (default)

  • "sen" — Sensitivity (True Positive Rate):

    \frac{TP}{TP + FN}

  • "spec" — Specificity (True Negative Rate):

    \frac{TN}{TN + FP}

  • "ppv" — Positive Predictive Value (Precision):

    \frac{TP}{TP + FP}

  • "npv" — Negative Predictive Value:

    \frac{TN}{TN + FN}

  • "fdr" — False Discovery Rate:

    1 - PPV = \frac{FP}{TP + FP}

  • "fom" — False Omission Rate:

    1 - NPV = \frac{FN}{TN + FN}

  • "ba" — Balanced Accuracy:

    \frac{Sensitivity + Specificity}{2}

  • "f1" — F1 Score (harmonic mean of PPV and Sensitivity):

    \frac{2TP}{2TP + FP + FN}

  • "csi" — Critical Success Index (Jaccard / Threat Score):

    \frac{TP}{TP + FP + FN}

  • "mcc" — Matthews Correlation Coefficient:

    \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

full.names

Boolean (length = 1). Whether full or abbreviated names should be used. Defaults to FALSE. Set to TRUE for full names

Value

A named numeric vector of the requested confusion matrix metrics. Values are generally in [0, 1], with the exception of "mcc" (Matthews Correlation Coefficient), which ranges from -1 to 1 where 1 indicates perfect agreement, 0 indicates chance-level performance, and -1 indicates perfect disagreement. The vector contains only the elements specified by metric (all ten metrics when metric = "all", which is the default). Names are abbreviated (e.g., "sen") when full.names = FALSE (default), or expanded (e.g., "Sensitivity") when full.names = TRUE. Any metric whose denominator is zero (e.g., sensitivity when no edges exist in base) is returned as NA.

Author(s)

Alexander P. Christensen <alexpaulchristensen@gmail.com>

Examples

# Set split
split <- sample(
  1:nrow(basic_smallworld),
  round(nrow(basic_smallworld) / 2)
)

# Estimate networks
split1 <- network_estimation(basic_smallworld[split,])
split2 <- network_estimation(basic_smallworld[-split,])

# Estimate metrics
edge_confusion(split1, split2)


L0 Norm Regularized Network Estimation

Description

A general function to estimate Gaussian graphical models using L0 penalty approximations. All penalties are implemented using either single-pass or full Local Linear Approximation (LLA: Fan & Li, 2001; Zou & Li, 2008)

Usage

network_estimation(
  data,
  n = NULL,
  corr = c("auto", "pearson", "spearman"),
  na_data = c("pairwise", "listwise"),
  penalty = c("atan", "exp", "gumbel", "log", "weibull"),
  gamma = NULL,
  adaptive = TRUE,
  nlambda = 50,
  lambda_min_ratio = 0.01,
  penalize_diagonal = TRUE,
  ic = c("AIC", "AICc", "BIC", "BIC0", "EBIC", "MBIC"),
  ebic_gamma = 0.5,
  fast = TRUE,
  LLA = FALSE,
  LLA_threshold = 0.001,
  LLA_iter = 10000,
  network_only = TRUE,
  verbose = FALSE,
  ...
)

Arguments

data

Matrix or data frame. Should consist only of variables to be used in the analysis

n

Numeric (length = 1). Sample size must be provided if data provided is a correlation matrix

corr

Character (length = 1). Method to compute correlations. Defaults to "auto". Available options:

  • "auto" — Automatically computes appropriate correlations for the data using Pearson's for continuous, polychoric for ordinal, tetrachoric for binary, and polyserial/biserial for ordinal/binary with continuous. To change the number of categories that are considered ordinal, use ordinal_categories (see polychoric_matrix for more details)

  • "pearson" — Pearson's correlation is computed for all variables regardless of categories

  • "spearman" — Spearman's rank-order correlation is computed for all variables regardless of categories

For other similarity measures, compute them first and input them into data with the sample size (n)

na_data

Character (length = 1). How should missing data be handled? Defaults to "pairwise". Available options:

  • "pairwise" — Computes correlation for all available cases between two variables

  • "listwise" — Computes correlation for all complete cases in the dataset

penalty

Character (length = 1). Available options:

  • "atan" — Arctangent (Wang & Zhu, 2016)

    \lambda \cdot (\gamma + \frac{2}{\pi}) \cdot \arctan\left(\frac{|x|}{\gamma}\right)

  • "exp" — EXP (Wang, Fan, & Zhu, 2018)

    \lambda \cdot (1 - e^{-\frac{|x|}{\gamma}})

  • "gumbel" — Gumbel

    \frac{\lambda}{1 - e^{-1}} \cdot \left(e^{-e^{-\frac{|x|}{\gamma}}} - e^{-1}\right)

  • "log" — Log (Candes, Wakin, & Boyd, 2008)

    \frac{\lambda \cdot \log\left(1 + \frac{|x|}{\gamma}\right)}{\log\left(1 + \frac{1}{\gamma}\right)}

  • "weibull" — Weibull

    \lambda \cdot \left(1 - e^{-\left(\frac{|x|}{\gamma}\right)^k}\right)

gamma

Numeric (length = 1). Adjusts the shape of the penalty. Defaults:

  • "atan" = 0.01

  • "exp" = 0.01

  • "gumbel" = 0.01

  • "log" = 0.10

  • "weibull" = 0.01

adaptive

Boolean (length = 1). Whether data-adaptive (gamma) parameters should be used. Defaults to TRUE. Set to FALSE to apply default gamma parameters for adaptive penalties. Available options:

  • "exp"

  • "gumbel"

  • "weibull"

All adaptive penalties use the 95% confidence interval from zero

nlambda

Numeric (length = 1). Number of lambda values to test. Defaults to 50

lambda_min_ratio

Numeric (length = 1). Ratio of lowest lambda value compared to maximal lambda. Defaults to 0.01

penalize_diagonal

Boolean (length = 1). Should the diagonal be penalized? Defaults to TRUE

ic

Character (length = 1). What information criterion should be used for model selection? Available options include:

  • "AIC" — Akaike's information criterion: -2L + 2E

  • "AICc" — AIC corrected: AIC + \frac{2E^2 + 2E}{n - E - 1}

  • "BIC" — Bayesian information criterion: -2L + E \cdot \log{(n)}

  • "BIC0" — Bayesian information criterion not (Dicker et al., 2013): \log{\large(\frac{D}{n - E}\large)} + \large(\frac{\log{(n)}}{n}\large) \cdot E

  • "EBIC" — Extended BIC: BIC + 4E \cdot \gamma \cdot \log{(E)}

  • "MBIC" — Modified Bayesian information criterion (Wang et al., 2018): \log{\large(\frac{D}{n - E}\large)} + \large(\frac{\log{(n)} \cdot E}{n}\large) \cdot \log{(\log{(p)}})

Term definitions:

  • n — sample size

  • p — number of variables

  • E — edges

  • S — empirical correlation matrix

  • K — estimated inverse covariance matrix (network)

  • L = \frac{n}{2} \cdot \log \text{det} K - \sum_{i=1}^p (SK)_{ii}

  • D = n \cdot \sum_{i=1}^p (SK)_{ii} - \log \text{det} K

Defaults to "BIC"

ebic_gamma

Numeric (length = 1) Value to set gamma parameter in EBIC (see above). Defaults to 0.50

Only used if ic = "EBIC"

fast

Boolean (length = 1). Whether the glassoFast version should be used to estimate the GLASSO. Defaults to TRUE.

The fast results may differ by less than floating point of the original GLASSO implemented by glasso and should not impact reproducibility much (set to FALSE if concerned)

LLA

Boolean (length = 1). Should Local Linear Approximation be used to find optimal minimum? Defaults to FALSE or a single-pass approximation, which can be significantly faster (Zou & Li, 2008). Set to TRUE to find global minimum based on convergence (LLA_threshold)

LLA_threshold

Numeric (length = 1). When performing the Local Linear Approximation, the maximum threshold until convergence is met. Defaults to 1e-03

LLA_iter

Numeric (length = 1). Maximum number of iterations to perform to reach convergence. Defaults to 10000

network_only

Boolean (length = 1). Whether the network only should be output. Defaults to TRUE. Set to FALSE to obtain all output for the network estimation method

verbose

Boolean (length = 1). Whether messages and (insignificant) warnings should be output. Defaults to FALSE (silent calls). Set to TRUE to see all messages and warnings for every function call

...

Additional arguments to be passed on to auto.correlate

Value

When network_only = TRUE (default), returns a p \times p numeric matrix of partial correlations representing the estimated Gaussian graphical model. Off-diagonal entry [i,j] is the partial correlation between variables i and j controlling for all other variables, with values in [-1, 1]; a value of zero indicates the absence of an edge. Diagonal entries are zero. Row and column names are inherited from data.

When network_only = FALSE, returns a named list with the following elements:

network

The p \times p partial correlation matrix described above

K

The p \times p estimated inverse covariance (precision) matrix at the optimal lambda. Diagonal entries are the conditional precisions; off-diagonal entries are proportional to partial covariances

R

The p \times p regularized covariance matrix returned by GLASSO at the optimal lambda (the w component of the GLASSO solution)

penalty

Character string naming the penalty function used (one of "atan", "exp", "gumbel", "log", "weibull")

lambda

Numeric scalar giving the regularization parameter value selected by the information criterion. Larger values correspond to sparser networks

gamma

Numeric scalar giving the shape parameter of the penalty actually used. For adaptive penalties (adaptive = TRUE), this is the data-derived value; otherwise it is the default or user-supplied value

correlation

The p \times p empirical correlation matrix computed from data, used as input to the GLASSO

criterion

Character string naming the information criterion used for model selection (e.g., "bic")

IC

Numeric scalar giving the value of the information criterion at the optimal lambda

Author(s)

Alexander P. Christensen <alexpaulchristensen@gmail.com>

References

Log penalty
Candes, E. J., Wakin, M. B., & Boyd, S. P. (2008). Enhancing sparsity by reweighted l1 minimization. Journal of Fourier Analysis and Applications, 14(5), 877–905.

BIC0
Dicker, L., Huang, B., & Lin, X. (2013). Variable selection and estimation with the seamless-L0 penalty. Statistica Sinica, 23(2), 929–962.

Local Linear Approximation
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360.

EXP penalty
Wang, Y., Fan, Q., & Zhu, L. (2018). Variable selection and estimation using a continuous approximation to the L0 penalty. Annals of the Institute of Statistical Mathematics, 70(1), 191–214.

Atan penalty
Wang, Y., & Zhu, L. (2016). Variable selection and parameter estimation with the Atan regularization method. Journal of Probability and Statistics, 2016, 1–12.

Seminal simulation in network psychometrics
Williams, D. R. (2020). Beyond lasso: A survey of nonconvex regularization in Gaussian graphical models. PsyArXiv.

One-step Local Linear Approximation
Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4), 1509–1533.

Examples

# Obtain default estimator (adaptive Weibull)
weibull_network <- network_estimation(
  data = basic_smallworld, LLA = TRUE
)

# Obtain Atan network
atan_network <- network_estimation(
  data = basic_smallworld, penalty = "atan", LLA = TRUE
)

# Obtain static EXP network
exp_network <- network_estimation(
  data = basic_smallworld, penalty = "exp",
  adaptive = FALSE, LLA = TRUE
)


Traditional Fit Metrics for Networks

Description

Computes several traditional fit metrics for networks including

Usage

network_fit(network, n, S, ci = 0.95)

Arguments

network

Matrix or data frame. A p by p square network matrix

n

Numeric (length = 1). Sample size

S

Matrix or data frame. A p by p square zero-order correlation matrix corresponding with the input network

ci

Numeric (length = 1). Confidence interval for RMSEA. Defaults to 0.95

Value

A named numeric vector of traditional and likelihood-based fit indices. The vector always contains the following elements:

chisq

Chi-square statistic (\chi^2 = n \cdot F_{ML}), where F_{ML} is the maximum likelihood discrepancy function between the model-implied and empirical correlation matrices

df

Degrees of freedom: total number of unique off-diagonal correlations minus the number of non-zero edges in network

chisq.p.value

p-value for the chi-square test of exact fit (H0: model-implied covariance equals the population covariance)

RMSEA

Root mean square error of approximation. Values \leq 0.05 indicate close fit; values \leq 0.08 indicate acceptable fit

RMSEA.XX.lower, RMSEA.XX.upper

Lower and upper bounds of the ci-level confidence interval for RMSEA, where XX is the integer percentage (e.g., RMSEA.95.lower and RMSEA.95.upper for a 95% CI)

RMSEA.p.value

p-value for the one-sided test of close fit (H0: RMSEA \leq 0.05)

CFI

Comparative fit index, comparing the target model to an independence (null) baseline. Values \geq 0.95 indicate acceptable fit

TLI

Tucker-Lewis index (non-normed fit index). Values \geq 0.95 indicate acceptable fit; can fall outside [0, 1] for severely misspecified models

SRMR

Standardized root mean residual: the root mean squared difference between the model-implied and observed correlation matrices. Values \leq 0.08 indicate acceptable fit

logLik

Gaussian log-likelihood of the model-implied correlation matrix, assuming zero mean structure (means are not estimated)

AIC

Akaike's information criterion: -2 \cdot \log L + 2 \cdot E, where E is the number of non-zero edges in network

BIC

Bayesian information criterion: -2 \cdot \log L + E \cdot \log(n)

Author(s)

Alexander P. Christensen <alexpaulchristensen@gmail.com>

References

Epskamp, S., Rhemtulla, M., & Borsboom, D. (2017). Generalized network psychometrics: Combining network and latent variable models. Psychometrika, 82(4), 904–927.

Examples

# Obtain correlation matrix
S <- auto_correlate(basic_smallworld)

# Obtain Weibull network
weibull_network <- network_estimation(data = basic_smallworld, LLA = TRUE)

# Obtain fit (expects continuous variables!)
network_fit(network = weibull_network, n = nrow(basic_smallworld), S = S)
# Scaled metrics are not yet available for
# dichotomous or polytomous data!


Computes Polychoric Correlations

Description

A fast implementation of polychoric correlations in C. Uses the Beasley-Springer-Moro algorithm (Boro & Springer, 1977; Moro, 1995) to estimate the inverse univariate normal CDF, the Drezner-Wesolosky approximation (Drezner & Wesolosky, 1990) to estimate the bivariate normal CDF, and Brent's method (Brent, 2013) for optimization of rho

Usage

polychoric_matrix(
  data,
  na_data = c("pairwise", "listwise"),
  empty_method = c("none", "zero", "all"),
  empty_value = c("none", "point_five", "one_over"),
  ...
)

Arguments

data

Matrix or data frame. A dataset with all ordinal values (rows = cases, columns = variables). Data are required to be between 0 and 11. Proper adjustments should be made prior to analysis (e.g., scales from -3 to 3 in increments of 1 should be shifted by added 4 to all values)

na_data

Character (length = 1). How should missing data be handled? Defaults to "pairwise". Available options:

  • "pairwise" — Computes correlation for all available cases between two variables

  • "listwise" — Computes correlation for all complete cases in the dataset

empty_method

Character (length = 1). Method for empty cell correction. Available options:

  • "none" — Adds no value (empty_value = "none") to the empirical joint frequency table between two variables

  • "zero" — Adds empty_value to the cells with zero in the joint frequency table between two variables

  • "all" — Adds empty_value to all in the joint frequency table between two variables

empty_value

Character (length = 1). Value to add to the joint frequency table cells. Accepts numeric values between 0 and 1 or specific methods:

  • "none" — Adds no value (0) to the empirical joint frequency table between two variables

  • "point_five" — Adds 0.5 to the cells defined by empty_method

  • "one_over" — Adds 1 / n where n equals the number of cells based on empty_method. For empty_method = "zero", n equals the number of zero cells

...

Not used but made available for easier argument passing

Value

A symmetric numeric matrix of dimension p \times p, where p is the number of variables (columns) in data. Each off-diagonal entry [i,j] is the polychoric correlation between variables i and j — the estimated Pearson correlation between the latent continuous variables assumed to underlie the observed ordinal categories — with values in [-1, 1]. Diagonal entries are 1. Row and column names are inherited from data. If any variable has zero variance, its corresponding row and column are set to NA and a warning is issued.

Author(s)

Alexander P. Christensen <alexpaulchristensen@gmail.com> with assistance from GPT-4

References

Beasley-Moro-Springer algorithm
Beasley, J. D., & Springer, S. G. (1977). Algorithm AS 111: The percentage points of the normal distribution. Journal of the Royal Statistical Society. Series C (Applied Statistics), 26(1), 118-121.

Moro, B. (1995). The full monte. Risk 8 (February), 57-58.

Brent optimization
Brent, R. P. (2013). Algorithms for minimization without derivatives. Mineola, NY: Dover Publications, Inc.

Drezner-Wesolowsky bivariate normal approximation
Drezner, Z., & Wesolowsky, G. O. (1990). On the computation of the bivariate normal integral. Journal of Statistical Computation and Simulation, 35(1-2), 101-107.

Examples

# Categorize data
basic_categorized <- apply(
  basic_smallworld, 2, categorize, categories = 5
)

# Compute polychoric correlation matrix
correlations <- polychoric_matrix(basic_categorized)

# Randomly assign missing data
basic_categorized[sample(1:length(basic_categorized), 1000)] <- NA

# Compute polychoric correlation matrix with pairwise missing
pairwise_correlations <- polychoric_matrix(
  basic_categorized, na_data = "pairwise"
)

# Compute polychoric correlation matrix with listwise missing
pairwise_correlations <- polychoric_matrix(
  basic_categorized, na_data = "listwise"
)


Construct a Near-Degree-Preserving Ring Lattice

Description

Converts a partial correlation network matrix into a ring lattice whose per-node degree sequence approximates the original as closely as possible. An adjacency matrix is derived automatically from network (non-zero entries become edges), so the function accepts any weighted or binary network directly. Nodes are internally reordered by degree before the lattice is built. Starting from a ring lattice dense enough to cover the maximum target degree, the algorithm runs two independent pruning passes — one visiting nodes from highest to lowest surplus and one from lowest to highest — and retains whichever pass produces the smaller total residual surplus (ties broken by the higher average clustering coefficient). As a final fallback, if the empirical network itself has a higher clustering coefficient than either pruned lattice, the empirical adjacency is returned instead. The function is designed for use as the lattice baseline in small-worldness calculations.

Usage

ring2lattice(network)

Arguments

network

Matrix. A square, symmetric numeric matrix representing a network (e.g., partial correlations). Non-zero off-diagonal entries are treated as edges; the binary adjacency is derived internally as network != 0. Isolated nodes (degree zero) are supported and are disconnected from the lattice before pruning begins.

Details

## Algorithm

Initialisation. The binary adjacency A = [\mathbf{1}(\text{network}_{ij} \neq 0)] is constructed internally. Nodes are reordered in decreasing degree before all subsequent steps. The target degree of node i (after reordering) is d_i = \sum_j A_{ij}. A ring lattice is seeded by connecting node i to node j whenever the ring distance between them does not exceed \lceil (d_i + d_j) / 2 \rceil, so every node begins with at least as many edges as its target degree. Nodes with d_i = 0 are immediately disconnected.

Circular distance. A distance matrix recording the ring distance between every pair of positions is precomputed and maintained throughout:

\delta_{ij} = \min\!\bigl(|i - j|,\; p - |i - j|\bigr)

Surplus degree. The surplus of node i is e_i = \hat{d}_i - d_i \geq 0, where \hat{d}_i is the current ring degree. Pruning continues while \sum_i e_i > 0 or until p \times 2 iterations have elapsed.

Dual-pass pruning. Two independent pruning passes are run on the same seed lattice:

Within each pass, for an over-degree node i, candidate removal partners are current neighbours that also carry surplus (e_j > 0). The partner to remove is selected by the following lexicographic priority:

  1. Clustering constraint — prefer edges whose removal reduces local clustering least, scored as 1 / (c_{ij} + 1), where c_{ij} is the number of common neighbours of i and j.

  2. Surplus — among equally ranked candidates, prefer the neighbour with the largest surplus e_j.

  3. Ring distance — as a final tiebreaker, prefer the neighbour furthest away on the ring (\delta_{ij}), keeping the lattice as local as possible.

Pass selection. After both passes complete, the one with the smaller total residual surplus \sum_i e_i is retained. If the two passes are tied on total surplus, the one with the higher average clustering coefficient is chosen.

Empirical fallback. If the average clustering coefficient of the winning lattice is still lower than that of the original empirical network, the empirical adjacency (reordered by degree) is returned as the lattice baseline instead.

Termination. Each pass exits when all surpluses reach zero or p \times 2 outer iterations have been completed without further change.

Value

A square symmetric binary adjacency matrix of the same dimension as network representing the resulting ring lattice. The average clustering coefficient of the returned lattice is attached as the attribute "CC" and can be retrieved with attr(result, "CC").

The degree of each node in the returned lattice is less than or equal to the corresponding degree in network; exact equality is achieved wherever the pruning algorithm can satisfy it within nodes * 2 iterations per pass.

Author(s)

Alexander P. Christensen <alexpaulchristensen@gmail.com>

References

Ring lattice and small-world networks
Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’ networks. Nature, 393(6684), 440–442. doi:10.1038/30918

Examples

# Get network
network <- network_estimation(basic_smallworld)

# Construct ring lattice
L <- ring2lattice(network)

# Retrieve the attached clustering coefficient
attr(L, "CC")

# Degree sequences should be identical (or very close)
cbind(target = colSums(network != 0), achieved = colSums(L))


Simulates Stochastic Block Model Data

Description

Simulates data from a Gaussian Graphical Model (GGM) with a stochastic block model (SBM) structure. Nodes are partitioned into communities, with edge density controlled separately within and between communities via a blocks x blocks density matrix. Absolute edge weights are drawn from a Weibull distribution whose parameters are predicted from the network size, sample size, and signal-to-noise ratio using a Seemingly Unrelated Regression (SUR) model fitted to 194 empirical psychometric networks (Huth et al., 2025). The resulting population partial correlation matrix is used to generate multivariate normal (or skewed) data.

A diffusion parameter controls the minimum proportion of the strongest edges that are reserved for within-community positions. Because the remaining edges are shuffled randomly, the effective within-community advantage will generally exceed 1 - diffusion; see Details.

Parameters do not have default values (except negative_proportion, diffusion, target_condition, max_correlation, and max_iterations) and must each be set. See Details and Examples to get started.

Usage

simulate_sbm(
  nodes,
  blocks,
  density_matrix,
  snr = 1,
  diffusion = 0.3,
  diffusion_range = NULL,
  negative_proportion,
  sample_size,
  skew = 0,
  skew_range = NULL,
  target_condition = 30,
  max_correlation = 0.8,
  max_iterations = 100
)

Arguments

nodes

Numeric (length = 1 or blocks). Number of nodes per community block. Can be a single value applied to all blocks, or a vector of length blocks specifying each block's size individually. Minimum of three nodes per block. The total number of nodes (sum(nodes)) should be between 8 and 54 to remain within the range of the empirical networks used to fit the Weibull parameter model; values outside this range are accepted but will trigger extrapolation and may produce a warning from weibull_parameters.

blocks

Numeric (length = 1). Number of community blocks.

density_matrix

Matrix (dim = blocks x blocks). A symmetric numeric matrix specifying edge probabilities within and between community blocks. Diagonal entry [i,i] gives the within-block edge probability for community i; off-diagonal entry [i,j] (i \neq j) gives the between-block edge probability for the pair of communities i and j. All entries must be in [0, 1] and the matrix must be symmetric (i.e., density_matrix[i,j] == density_matrix[j,i]). See Details for construction guidance.

snr

Numeric (length = 1). Signal-to-noise ratio of the absolute partial correlation weights, defined as \bar{|w|} / \mathrm{SD}(|w|). This ratio governs the shape and scale of the Weibull distribution used to generate edge weights: values below 1 produce wider, more heterogeneous weight distributions; values above 1 produce narrower, more homogeneous distributions. Note that the same SNR can arise from different combinations of mean and standard deviation (e.g., mean = 0.05, SD = 0.05 and mean = 0.15, SD = 0.15 both give SNR = 1), so SNR alone does not determine the absolute magnitude of edge weights, which is additionally governed by nodes and sample_size via weibull_parameters. Empirically observed SNR values ranged from 0.648 to 1.712; values outside this range are accepted but will trigger a warning. Defaults to 1.

diffusion

Numeric (length = 1). Controls the minimum proportion of the top-ranked edges (by absolute weight) that are guaranteed to be placed within communities rather than distributed freely across the network. Specifically, 1 - diffusion of the within-community edges are reserved for the highest-weight draws; the remaining edges (both within- and between-community) are filled from a randomly shuffled pool. Because shuffled edges can land within communities by chance, the actual proportion of top edges that end up within communities will on average exceed 1 - diffusion. Lower values of diffusion (e.g., 0.10) therefore produce stronger community contrast than higher values (e.g., 0.90), but diffusion should be interpreted as a floor on within-community weight concentration, not an exact control. Defaults to 0.30. Must be between 0 and 1.

diffusion_range

Numeric (length = 2). If provided, overrides diffusion by drawing the diffusion proportion uniformly from this interval on each call. Useful for introducing replication-to-replication variability in community contrast. For example, diffusion_range = c(0.05, 0.20) samples a value between 5% and 20% on each draw. The same floor interpretation applies as for diffusion. Both values must be between 0 and 1.

negative_proportion

Numeric (length = 1). Proportion of between-community edges assigned a negative sign (inhibitory connections). Each between-community edge is independently signed negative with this probability (i.e., a Bernoulli draw per edge), so the realized proportion will vary around the specified value. Must be between 0 and 1. Applies only to between-community edges; all within-community edges are positive. If not provided, a value is sampled from a truncated normal distribution reflecting the empirical distribution of true negative partial correlations across 194 psychometric networks: mean = 0.34, SD = 0.086, bounded to [0.083, 0.55].

sample_size

Numeric (length = 1). Number of observations to generate from the population multivariate distribution. Also influences the predicted Weibull scale parameter via weibull_parameters: larger samples are associated with smaller, more precisely estimated edge weights.

skew

Numeric (length = 1 or sum(nodes)). Skew applied to each variable after generation from the multivariate normal. Can be a single value applied to all variables, or one value per variable. Values are rounded to the nearest 0.05 increment and must be in [-2, 2]. Defaults to 0 (no skew).

skew_range

Numeric (length = 2). If provided, overrides skew by drawing a skew value independently and uniformly from this interval for each variable. Both values must be in [-2, 2].

target_condition

Numeric (length = 1). Target condition number (computed via kappa with exact = TRUE) used when ridge regularization is required to recover a positive definite precision matrix. The smallest ridge penalty \lambda that brings the condition number to this target is found via root-finding (uniroot), subject to a maximum shrinkage of approximately 23% following Peeters et al. (2020). After conditioning, the Weibull bounds are re-verified on the updated edge weights; draws that fall outside the empirical bounds after conditioning are rejected. Lower values produce better-conditioned (more stable) matrices. Defaults to 30. Values up to 100 are accepted but not recommended.

max_correlation

Numeric (length = 1). Maximum allowed absolute pairwise correlation in the population correlation matrix R. Any draw where max(abs(R[lower.tri(R)])) > max_correlation is rejected and a new attempt is made. Must be between 0 and 1. Defaults to 0.80.

max_iterations

Numeric (length = 1). Maximum number of attempts to find a connected network with valid edge weights before stopping with an error. The error message reports a frequency table of rejection reasons to assist with diagnosing convergence failures. Defaults to 100.

Details

Constructing density_matrix

The density_matrix is a blocks x blocks symmetric matrix where entry [i,i] gives the within-block edge probability for community i, and entry [i,j] (for i \neq j) gives the between-block edge probability for communities i and j. The simplest construction uses a uniform off-diagonal density with block-specific diagonals:

# Uniform within (0.90) and between (0.20) density for 3 blocks
dm <- matrix(0.20, nrow = 3, ncol = 3)
diag(dm) <- 0.90

For asymmetric community structure, each diagonal entry can differ:

# Varying within-block density per community
dm <- matrix(0.20, nrow = 3, ncol = 3)
diag(dm) <- c(0.85, 0.90, 0.95)

For full pairwise control over between-block densities, specify the complete symmetric matrix directly:

dm <- matrix(c(
  0.90, 0.20, 0.10,
  0.20, 0.85, 0.30,
  0.10, 0.30, 0.95
), nrow = 3, ncol = 3)

Diffusion and within-community weight concentration

The diffusion parameter does not exactly fix the proportion of top-ranked edges placed within communities. Instead, 1 - diffusion of the within-community edge slots are filled deterministically from the highest-weight draws. The remaining edge slots (within- and between-community alike) are filled from a randomly shuffled pool of lower-ranked weights, meaning some additional high-weight edges will land within communities by chance. The realized within-community weight advantage will therefore always be at least as large as implied by 1 - diffusion, and typically larger. The Q field in the returned parameters list (Newman-Girvan modularity) provides a post-hoc summary of the actual community contrast achieved.

Value

A named list with four elements:

data

Numeric matrix of dimension sample_size x sum(nodes) containing the simulated observations drawn from the population GGM. Rows are cases; columns are variables named V01, V02, etc. Values are continuous (or skewed continuous when skew != 0). To produce ordinal data, pass the columns through categorize.

parameters

A list of input, derived, and estimated parameters:

  • nodes — Integer vector of length blocks giving the number of nodes per block (scalar input is expanded to this length)

  • blocks — Number of community blocks

  • sample_size — Number of simulated observations

  • skew — Named numeric vector of per-variable skew values actually applied (after rounding and possible resampling)

  • density_matrix — The blocks x blocks density matrix as supplied

  • negative_proportion — The proportion of between-community edges assigned a negative sign, either as supplied or as sampled from the empirical distribution

  • weibull — Named numeric vector of length 2 giving the Weibull shape and scale parameters of the absolute edge weight distribution actually used. If ridge conditioning was applied, these are re-estimated from the conditioned network via MLE.

  • diffusion — Numeric vector of length 2 giving the within-block reservation range as a proportion of within-community edges, on the internal scale used by the sampler (i.e., range(1 - diffusion) or range(1 - diffusion_range)). Both values are equal when diffusion is scalar. Note this is the complement of the user-supplied diffusion value and represents the fraction of within-community slots filled from the top-ranked draws.

  • Q — Newman-Girvan modularity of the population network (Omega) with respect to the block membership, computed via igraph::modularity on absolute edge weights. Provides a summary of the community contrast actually achieved after weight assignment and any ridge conditioning.

population

Population-level network parameters:

  • R — Population correlation matrix derived from the GGM via pcor2cor

  • Omega — Population partial correlation matrix (the GGM edge weight matrix), with zeros for absent edges

  • membership — Named integer vector of length sum(nodes) giving the community block assignment (1 to blocks) for each node

convergence

Iteration and conditioning diagnostics:

  • iterations — Number of sampling attempts needed to find a valid network (including graph structure and edge weight draws)

  • rejections — Character vector of length max_iterations + 1 recording the rejection reason for each failed attempt; entries for successful or unused iterations are empty strings. Common reasons include disconnected graph structure, condition number exceeding target_condition, maximum correlation exceeding max_correlation, and Weibull parameters falling outside empirical bounds after ridge conditioning.

  • lambda — Ridge regularization parameter \lambda added to the diagonal of the precision matrix to ensure positive definiteness; NA if no conditioning was required

  • condition — Condition number of the final population correlation matrix R, computed via kappa with exact = TRUE

Author(s)

Alexander P. Christensen <alexpaulchristensen@gmail.com>

References

Seminal introduction to Stochastic Block Models
Holland, P. W., Laskey, K. B., & Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks, 5(2), 109–137.

Empirical network data used to fit the Weibull SUR model
Huth, K. B. S., Haslbeck, J. M. B., Keetelaar, S., Van Holst, R. J., & Marsman, M. (2025). Statistical evidence in psychological networks. Nature Human Behaviour.

Maximum ridge shrinkage bound
Peeters, C. F., van de Wiel, M. A., & van Wieringen, W. N. (2020). The spectral condition number plot for regularization parameter evaluation. Computational Statistics, 35(2), 629–646.

Examples

# Construct density matrix for 3 blocks with uniform densities
dm <- matrix(0.20, nrow = 3, ncol = 3)
diag(dm) <- 0.90

# Basic 3-block simulation with equal-sized communities
result <- simulate_sbm(
  nodes = 6, # 6 nodes per block = 18 total
  blocks = 3,
  sample_size = 500,
  density_matrix = dm
)

# Unequal block sizes
result <- simulate_sbm(
  nodes = c(4, 6, 8), # 18 total nodes
  blocks = 3,
  sample_size = 500,
  density_matrix = dm
)

# Varying within-block density per community
dm_varying <- matrix(0.20, nrow = 3, ncol = 3)
diag(dm_varying) <- c(0.85, 0.90, 0.95)

result <- simulate_sbm(
  nodes = 6,
  blocks = 3,
  sample_size = 500,
  density_matrix = dm_varying
)

# Full pairwise between-block density control
dm_pairwise <- matrix(c(
  0.90, 0.20, 0.10,
  0.20, 0.85, 0.30,
  0.10, 0.30, 0.95
), nrow = 3, ncol = 3)

result <- simulate_sbm(
  nodes = 6,
  blocks = 3,
  sample_size = 500,
  density_matrix = dm_pairwise
)

# Fix the proportion of negative between-community edges
result <- simulate_sbm(
  nodes = 6,
  blocks = 3,
  sample_size = 500,
  density_matrix = dm,
  negative_proportion = 0.20
)

# Introduce variability in diffusion across replications
result <- simulate_sbm(
  nodes = 6,
  blocks = 3,
  sample_size = 500,
  density_matrix = dm,
  diffusion_range = c(0.30, 0.70)
)


Simulates Small-World GGM Data

Description

Simulates data from a Gaussian Graphical Model (GGM) with a small-world network structure. The generative process proceeds in three stages. First, a ring lattice is constructed with neighbors + 1 nearest-neighbor connections per node, where neighbors is derived from nodes and density. Second, the lattice is randomly pruned to the target density, introducing degree heterogeneity from the outset. Third, edges are rewired with probability rewire, where rewired edges are placed preferentially on node pairs with higher combined degree (degree- weighted rewiring). This approach produces more realistic degree distributions than standard Watts-Strogatz rewiring while preserving the local clustering structure of the lattice. Edge weights are assigned by structural priority: edges retained from the pruned lattice receive larger partial correlation weights based on their original neighbor distance, grounded in the empirical observation that shorter-distance connections tend to carry stronger weights in psychometric networks. The resulting network is used to generate multivariate normal (or skewed) data. Parameters do not have default values (except negative_proportion, snr, target_condition, max_correlation, and max_iterations) and must each be set. See Details and Examples to get started.

Usage

simulate_smallworld(
  nodes,
  density,
  rewire,
  snr = 1,
  negative_proportion,
  sample_size,
  skew = 0,
  skew_range = NULL,
  target_condition = 30,
  max_correlation = 0.8,
  max_iterations = 100
)

Arguments

nodes

Numeric (length = 1). Number of nodes in the network. Minimum of three nodes. The total number of nodes should be between 8 and 54 to remain within the range of the empirical networks used to fit the Weibull parameter model; values outside this range are accepted but will trigger extrapolation and may produce a warning from weibull_parameters.

density

Numeric (length = 1). Target edge density of the network after pruning the initial ring lattice. Controls the number of nearest neighbors k each node is connected to via k = \mathrm{round}((\text{nodes} \times \frac{\text{nodes}-1}{2} \times \text{density}) / \text{nodes}), subject to a minimum of 2 and a maximum of \lfloor (\text{nodes}-1)/2 \rfloor. A ring lattice with neighbors + 1 connections per node is first generated and then pruned to this density, introducing degree heterogeneity before rewiring. Must be between 0 and 1. A minimum density sufficient to maintain a connected graph is enforced; values below this threshold will produce an informative error.

rewire

Numeric (length = 1). Probability of rewiring each edge. Unlike the standard Watts-Strogatz model, rewired edges are placed using degree-weighted random selection: node pairs with higher combined degree have a greater probability of receiving a rewired edge (see Details). Values near 0 preserve the pruned lattice structure; values near 1 produce approximately random networks. The small-world regime typically occurs at intermediate values (roughly 0.01 to 0.30). Must be between 0 and 1.

snr

Numeric (length = 1). Signal-to-noise ratio of the absolute partial correlation weights, defined as \bar{|w|} / \mathrm{SD}(|w|). Values less than 1 produce wider, more heterogeneous weight distributions; values greater than 1 produce narrower, more homogeneous distributions. Empirically observed SNR values ranged from 0.648 to 1.712; values outside this range are accepted but will trigger a warning. Defaults to 1.

negative_proportion

Numeric (length = 1). Proportion of edges that are negative (inhibitory). Must be between 0 and 0.50. The upper bound of 0.50 is a mathematical constraint of the sign-flipping procedure used to assign negative edges (see Details). If not provided, a value is sampled from a truncated normal distribution reflecting the empirical distribution of true negative partial correlations across 194 psychometric networks: mean = 0.34, SD = 0.086, bounded to [0.083, 0.50].

sample_size

Numeric (length = 1). Number of observations to generate from the population multivariate normal distribution. Also influences the predicted Weibull scale parameter via weibull_parameters: larger samples are associated with smaller, more precisely estimated edge weights.

skew

Numeric (length = 1 or nodes). Skew applied to each variable after generation from the multivariate normal. Can be a single value (applied to all variables) or one value per variable. Values are rounded to the nearest increment of 0.05 in the range [-2, 2]. Defaults to 0 (no skew).

skew_range

Numeric (length = 2). If provided, a skew value is drawn uniformly from this range for each variable, overriding skew. Both values must be between -2 and 2.

target_condition

Numeric (length = 1). Target condition number (using kappa with exact = TRUE) applied when ridge regularization is needed to recover a positive definite precision matrix. The smallest ridge penalty \lambda that brings the condition number to this target is found via root-finding (uniroot), subject to a maximum shrinkage of approximately 23% following Peeters et al. (2020). Lower values produce better-conditioned matrices. Defaults to 30. Values up to 100 are accepted but not recommended.

max_correlation

Numeric (length = 1). Maximum allowed absolute pairwise correlation in the population correlation matrix R. Any draw where max(abs(R[lower.tri(R)])) > max_correlation is rejected and a new attempt is made. Must be between 0 and 1. Defaults to 0.80.

max_iterations

Numeric (length = 1). Maximum number of attempts to find (1) a connected network structure, (2) a network satisfying the small-world screening criterion, and (3) a valid set of edge weights. The error message reports a frequency table of rejection reasons to assist with diagnosing convergence failures. Defaults to 100.

Details

Lattice generation and pruning

The generative process begins by constructing a ring lattice with neighbors + 1 nearest-neighbor connections per node via sample_smallworld with p = 0. The +1 overshoot ensures the lattice always has more edges than the target density, guaranteeing that pruning can proceed. The lattice is then randomly pruned to the target density by removing edges uniformly at random, subject to the constraint that the pruned graph remains connected. Because removal is random, different nodes lose different numbers of edges, producing a heterogeneous degree distribution before any rewiring occurs. This heterogeneity provides a non-uniform prior for the subsequent degree-weighted rewiring step.

Degree-weighted rewiring

Each edge in the pruned lattice is independently selected for rewiring with probability rewire. For each selected edge (i, j), node i is kept fixed and the j endpoint is redirected to a new target node. Valid targets are restricted to node pairs involving node i that (1) are not currently connected, and (2) have never been occupied during the current rewiring pass (i.e., were absent in the original pruned lattice). Among valid targets, the new endpoint is selected with probability proportional to \sqrt{d_i + d_k}, where d_i and d_k are the current degrees of the two nodes in the candidate pair. The square-root transformation moderates the rich-get-richer tendency of linear preferential attachment, producing degree heterogeneity consistent with the empirical range of psychometric networks without generating extreme hubs. Node degrees are updated incrementally after each rewire so that subsequent rewiring steps reflect the current graph state.

Edge weight assignment

Edge weights are assigned by ranking edges according to their distance in the pruned lattice before rewiring. Edges retained from the pruned lattice receive their original neighbor distance (1 = nearest neighbor, 2 = second nearest, etc.); rewired edges are assigned a distance of neighbors + 1, placing them at the bottom of the priority order. Absolute edge weights are drawn from a Weibull distribution whose parameters are predicted from the network size, sample size, and signal- to-noise ratio using a Seemingly Unrelated Regression (SUR) model fitted to 194 empirical psychometric networks (Huth et al., 2025). The sorted (descending) Weibull weights are then mapped onto the distance ranking so that the largest weights go to the shortest-distance edges. This assignment is grounded in the empirical observation that shorter-distance local connections tend to carry larger partial correlation weights in psychometric networks.

Negative edge assignment

Negative edges are introduced by flipping the signs of a subset of nodes — multiplying all edges incident to those nodes by -1. The number of nodes to flip is derived from negative_proportion via the inversion formula (1 - \sqrt{1 - 2p}) / 2, where p is the target proportion of negative edges. This formula assumes that an edge is negative if and only if exactly one of its endpoints is flipped, giving an expected negative proportion of 2 \cdot (k/n) \cdot (1 - k/n) for k flipped nodes out of n total. The formula requires p \leq 0.50, which is why negative_proportion is bounded at 0.50.

Small-worldness screening

Each generated adjacency structure is screened using the omega statistic (Telesford et al., 2011): \omega = (L_r / L) - (C / C_l), where L is the average shortest path length (ASPL) of the network, L_r is the expected ASPL of a random graph with the same degree sequence, C is the average clustering coefficient, and C_l is the clustering coefficient of the pruned lattice (before rewiring). Networks with |\omega| > 0.80 are rejected. The random-graph ASPL L_r is computed analytically via Equation 54 of Newman, Strogatz, and Watts (2001): L_r = \ln(N / z_1) / \ln(z_2 / z_1) + 1, where z_1 is the mean degree and z_2 = \overline{k^2} - \overline{k} is the mean number of second neighbors, when the conditions N > z_1 and z_2 > z_1 are satisfied. When these conditions are not met, a simulation-based estimate using iter = 100 degree-sequence-matched random graphs is used as a fallback. Note that the lattice clustering reference C_l is computed from the pruned lattice rather than from an idealized regular ring, which is internally consistent with the data generating process.

Value

A named list with four elements:

data

Numeric matrix of dimension sample_size x nodes containing the simulated observations drawn from the population GGM. Rows are cases; columns are variables named V01, V02, etc. Values are continuous (or skewed continuous when skew != 0). To produce ordinal data, pass the columns through categorize.

parameters

A list of input, derived, and estimated parameters:

  • nodes — Number of nodes (as supplied)

  • density — Target edge density (as supplied)

  • neighbors — Number of nearest neighbors k derived from nodes and density; the initial ring lattice uses neighbors + 1 connections before pruning

  • rewire — Edge rewiring probability (as supplied)

  • negative_proportion — Proportion of negative edges, either as supplied or as sampled from the empirical distribution

  • sample_size — Number of simulated observations

  • skew — Named numeric vector of per-variable skew values actually applied (after rounding and possible resampling)

  • weibull — Named numeric vector of length 2 giving the Weibull shape and scale parameters of the absolute edge weight distribution actually used. If ridge conditioning was applied, these are re-estimated from the conditioned network via MLE.

  • omega — Smallworldness omega statistic of the generated network (Telesford et al., 2011). Values near zero indicate small-world structure; negative values indicate lattice-like structure; positive values indicate random-like structure

  • Q — Newman-Girvan modularity of the population network (omega) computed via igraph::modularity on absolute edge weights, using the community partition that maximizes modularity (igraph::cluster_optimal). Because cluster_optimal finds the exact modularity maximum, Q represents an upper bound on recoverable community structure in the network rather than the result of a heuristic partition. Values near zero indicate absence of community structure, consistent with the network theory of psychopathology. Note that cluster_optimal is computationally intensive for large networks; runtime increases substantially beyond 50 nodes

population

Population-level network parameters:

  • R — Population correlation matrix derived from the GGM via pcor2cor

  • omega — Population partial correlation matrix (the GGM edge weight matrix), with zeros for absent edges

convergence

Iteration and conditioning diagnostics:

  • iterations — Number of sampling attempts needed to find a valid network

  • rejections — Character vector recording the rejection reason for each failed attempt. Common reasons include disconnected graph structure, omega exceeding the small-world threshold, condition number exceeding target_condition, and maximum correlation exceeding max_correlation

  • lambda — Ridge regularization parameter \lambda added to the diagonal of the precision matrix to ensure positive definiteness; NA if no conditioning was required

  • condition — Condition number of the final population correlation matrix R

Author(s)

Alexander P. Christensen <alexpaulchristensen@gmail.com>

References

Seminal introduction to the Watts-Strogatz small-world model
Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of 'small-world' networks. Nature, 393(6684), 440–442.

Logic for weight assignments
Muldoon, S. F., Bridgeford, E. W., & Bassett, D. S. (2016). Small-world propensity and weighted brain networks. Scientific Reports, 6(1), 22057.

Analytical approximation of random-graph average path length
Newman, M. E. J., Strogatz, S. H., & Watts, D. J. (2001). Random graphs with arbitrary degree distributions and their applications. Physical Review E, 64(2), 026118.

Empirical network data used to fit the Weibull SUR model
Huth, K. B. S., Haslbeck, J. M. B., Keetelaar, S., Van Holst, R. J., & Marsman, M. (2025). Statistical evidence in psychological networks. Nature Human Behaviour.

Maximum ridge shrinkage bound
Peeters, C. F., van de Wiel, M. A., & van Wieringen, W. N. (2020). The spectral condition number plot for regularization parameter evaluation. Computational Statistics, 35(2), 629–646.

Examples

# Basic small-world network (moderate density, moderate rewiring)
result <- simulate_smallworld(
  nodes = 20,
  density = 0.30,
  rewire = 0.20,
  sample_size = 500
)

# Lattice-like structure (low rewiring preserves local connectivity)
result <- simulate_smallworld(
  nodes = 20,
  density = 0.30,
  rewire = 0.01,
  sample_size = 500
)

# Random-like structure (high rewiring destroys lattice regularity)
result <- simulate_smallworld(
  nodes = 20,
  density = 0.30,
  rewire = 0.80,
  sample_size = 500
)

# Fix the proportion of negative edges
result <- simulate_smallworld(
  nodes = 20,
  density = 0.30,
  rewire = 0.20,
  sample_size = 500,
  negative_proportion = 0.20
)


Skew Tables

Description

Tables for skew based on the number of categories (2, 3, 4, 5, or 6) in the data

Usage

data(skew_tables)

Format

A list (length = 5)

Examples

data("skew_tables")


Predict Weibull Parameters for Edge Weight Distributions

Description

Predicts the shape and scale parameters of a Weibull distribution that characterizes the absolute partial correlation edge weights of a psychometric network, given its number of nodes and sample size. Parameter estimates are derived from a Seemingly Unrelated Regression (SUR) model fitted to empirical network data from Huth et al. (2025), where absolute partial correlations were found to follow a Weibull distribution more consistently than Beta, Gamma, or log-normal alternatives.

Usage

weibull_parameters(nodes, sample_size, snr = 1, bootstrap = FALSE)

Arguments

nodes

Integer. The number of nodes (variables) in the network. Must be between 8 and 54, reflecting the range of the empirical networks used to fit the underlying SUR model.

sample_size

Integer. The sample size of the dataset from which the network is estimated.

snr

Numeric (length = 1). Signal-to-noise ratio of partial correlations (\bar{|w|} / \mathrm{SD}(|w|)). Values less than 1 indicate wider range of partial correlations (w) whereas values greater than 1 indicate narrower range. Defaults to 1 where the mean of the partial correlations (\bar{|w|}) is equal to the standard deviation (\mathrm{SD}(|w|))

bootstrap

Logical. If TRUE, a randomly sampled residual from the SUR model fit is added to each predicted parameter, introducing empirically grounded variability suitable for use in simulation or bootstrapping contexts. Defaults to FALSE.

Details

The shape and scale parameters are predicted from derived network descriptors that differ between the two SUR equations:

rlp

The reciprocal log of the number of nodes, 1 / \log(p), capturing the diminishing marginal effect of network size on edge weight distributions. Used in both the shape and scale equations.

scaling

The standard error of partial correlations defined as \sqrt{1 / (n - p - 2)}, where n is the sample size and p is the number of nodes. Larger values indicate greater sampling uncertainty in the partial correlation estimates. Used in the scale equation only.

The two SUR equations have an asymmetric structure reflecting different theoretical roles for sampling precision. Shape — which governs the concentration of the edge weight distribution — is determined solely by signal characteristics of the network via snr and rlp. Scale — which governs the typical magnitude of edge weights — additionally depends on scaling, as the expected size of partial correlations is directly affected by estimation precision. This asymmetry is both empirically supported (dropping scaling from the shape equation costs \Delta R^2 < 0.006) and theoretically coherent.

These predictors enter two SUR equations whose coefficients are stored in the internal weibull_weights dataset. SUR was used to account for the correlated residuals between the shape and scale equations across networks (residual correlation = 0.261). Model fit was strong: the shape equation achieved R^2 = 0.887 (RMSE = 0.048) and the scale equation achieved R^2 = 0.885 (RMSE = 0.011). Shape residuals were normally distributed (Shapiro-Wilk W = 0.990, p = 0.173). Scale residuals showed a modest departure from normality (Shapiro-Wilk W = 0.973, p < 0.001), consistent with slight right skew in the scale outcome and test sensitivity at n = 194 rather than a substantive violation. Heteroskedasticity was detected in both the shape equation (Breusch-Pagan p < 0.001) and the scale equation (Breusch-Pagan p < 0.001); robust standard errors (HC3) were used for inference. Multicollinearity among predictors in the scale equation was negligible (VIF \leq 1.60); the shape equation contains only two predictors with no multicollinearity concern.

nodes influences predicted shape and scale via rlp. sample_size influences predicted scale via scaling, and both parameters via their joint contribution to snr when it is estimated from data rather than supplied directly.

Empirically, shape values ranged from approximately 0.72 to 1.63 (M = 1.07, SD = 0.14) and scale values from approximately 0.03 to 0.19 (M = 0.10, SD = 0.03) across the 194 networks used to fit the model. Shape values near 1 indicate approximately exponential edge weight distributions; values above 1 indicate a rising hazard (mode-bearing distribution).

When bootstrap = TRUE, residuals are drawn via shuffle() – a random sampling without replacement – from the empirical SUR residuals, preserving the observed marginal residual distribution.

Value

A named numeric vector of length 2:

shape

The predicted Weibull shape parameter (k > 0).

scale

The predicted Weibull scale parameter (\lambda > 0).

Author(s)

Alexander P. Christensen <alexpaulchristensen@gmail.com>

References

Huth, K. B. S., Haslbeck, J. M. B., Keetelaar, S., Van Holst, R. J., & Marsman, M. (2025). Statistical evidence in psychological networks. Nature Human Behaviour.

Examples

# Predict parameters for a 10-node network with n = 500
weibull_parameters(nodes = 10, sample_size = 500)

# With bootstrapped residuals for use in simulation
weibull_parameters(nodes = 10, sample_size = 500, bootstrap = TRUE)


SUR Model Coefficients and Residuals for Weibull Parameter Prediction

Description

A named list encoding the Seemingly Unrelated Regression (SUR) model fitted to Weibull shape and scale parameters derived from 194 empirical networks. The object is consumed internally by weibull_parameters to generate data-driven Weibull parameter estimates given a network's number of nodes and sample size.

Usage

data(weibull_weights)

Format

A named list with two elements, shape and scale, each containing:

coefficients

A named numeric vector of regression coefficients from the SUR model. The equations are asymmetric: the shape equation includes an intercept and two predictors (snr, rlp); the scale equation includes an intercept and three predictors (snr, rlp, scaling). Predictors are defined as follows:

snr

Signal-to-noise ratio of the absolute partial correlations, computed as the mean divided by the standard deviation of the absolute edge weights. Used in both equations.

rlp

Reciprocal log of node count, 1 / \log(p), where p is the number of nodes. Used in both equations.

scaling

Standard error of partial correlations, \sqrt{1 / (n - p - 2)}, where n is the sample size and p is the number of nodes. Used in the scale equation only.

residuals

A numeric vector of residuals from the fitted SUR equation, used by weibull_parameters to introduce empirically grounded variability when bootstrap = TRUE.

Details

Absolute partial correlations from 222 deduplicated empirical networks (Huth et al., 2025) were fitted to Beta, Gamma, log-normal, and Weibull distributions via maximum likelihood. Weibull provided the best fit most consistently: it outperformed each alternative by more than 2 log-likelihood units far more often than the reverse (vs. Beta: 56–0; vs. Gamma: 15–2; vs. log-normal: 155–13; vs. Exponential: 54–0).

The resulting Weibull shape and scale parameters were then jointly modelled as a function of network descriptors using Seemingly Unrelated Regression (systemfit), which accounts for correlated residuals between the shape and scale equations across networks (residual correlation = 0.261). Prior to fitting, networks with fewer than eight nodes (p < 8), more than 300,000 observations, or fewer than one observation per edge (\text{ope} \leq 1) were excluded, as Huth et al. (2025) demonstrated that networks in this regime show the weakest statistical evidence for edge presence or absence, yielding unstable parameter estimates (n = 28 excluded). This left n = 194 networks for analysis. Shape and scale parameters were modelled on their original scales.

The two equations have an asymmetric structure. Shape — which governs the concentration of the edge weight distribution — is predicted from snr and rlp only, reflecting that the shape of the distribution is a property of the network's signal structure independent of sampling precision. Scale — which governs typical edge weight magnitude — additionally includes scaling, as the expected size of partial correlations is directly affected by estimation precision. Dropping scaling from the shape equation costs \Delta R^2 < 0.006 and yields cleaner inference; all shape predictors are significant under HC3-robust standard errors (both p < 0.001).

Variance inflation factors for the scale equation (\leq 1.60) confirmed the absence of problematic multicollinearity; the shape equation contains only two predictors with no multicollinearity concern. Breusch-Pagan tests indicated statistically significant heteroskedasticity in both equations; however, all predictors remained significant under HC3-robust standard errors, indicating no material effect on inference. Shape residuals were normally distributed (Shapiro-Wilk W = 0.990, p = 0.173). Scale residuals showed a modest departure from normality (Shapiro-Wilk W = 0.973, p < 0.001), consistent with slight right skew in the scale outcome and test sensitivity at n = 194 rather than a substantive violation, as confirmed by visual inspection of the residual histogram. Model fit was strong: shape R^2 = 0.887 (RMSE = 0.048); scale R^2 = 0.885 (RMSE = 0.011).

References

Huth, K. B. S., Haslbeck, J. M. B., Keetelaar, S., Van Holst, R. J., & Marsman, M. (2025). Statistical evidence in psychological networks. Nature Human Behaviour.

Examples

data("weibull_weights")

# Inspect SUR coefficients for each equation
weibull_weights$shape$coefficients
weibull_weights$scale$coefficients

# Predict Weibull parameters for a new network
weibull_parameters(nodes = 12, sample_size = 500)