lakhesis: Consensus Seriation for Binary Data

CRAN status

The R package lakhesis provides an interactive platform and critical measures for seriating binary data matrices through the exploration, selection, and consensus of partially seriated sequences.

Seriation (sequencing, ordination) involves putting a set of things in an optimal order. In archaeology, seriation can be used to establish a chronological order of contexts and find-types on the basis of their similarity, i.e, that things come into and go out of fashion with a peak moment of popularity (Ihm 2005). In ecology, the distribution of a species may occur according to a preferred environmental condition that diminishes as that environment changes (ter Braak and Looman 1986). There are a number of R functions and packages, especially seriation (Hahsler, Hornik, and Buchcta 2008) and vegan (Oksanen et al. 2024) that provide the means to seriate or ordinate matrices, especially for frequency or count data. While binary (presence/absence) data are often viewed as a reductive case of frequency data, they can also present their own challenges. Moreover, not all incidence matrices (the matrix of 0/1s that record the joint incidence or occurrence for a row-column pairing) will be well seriated. The selection of which row and column elements to inlcude in the input is accordingly an intrinsic part of the task of seriation. In this respect, lakhesis seeks to complement existing methods in R, focusing on binary data, by providing an interactive, graphical means of selecting seriated sequences. It relies correspondence analysis (CA), a mainstay technique for seriation, and offers a method of Procrustes-fit CA to align scores with an ideal reference curve. Multiple seriations can be rerun on partial subsets, called “strands,” of the initial incidence matrix, which are then recompiled into a single consensus seriation using an optimality criterion. The process of harmonizing different strands of sequential elements via iterative linear regression is called a lakhesis technique, after the fate from ancient Greek mythology who measured the strand of one’s life. The package relies on Rcpp and RcppArmadillo (Eddelbuettel and Sanderson 2014; Eddelbuettel and Balamuta 2018).

While command line functions can be run in R, the functionality of lakhesis is primarily achieved via the Lakhesis Calculator, a graphical interface in shiny (Chang et al. 2024) that enables investigators to explore datasets, select strands, and harmonize them into a single consensus seriation. Panels in the calculator include:

The sidebar contains the following commands:

Installation

To obtain the current development version of lakhesis from GitHub, install from GitHub in the R command line with:

library(devtools)
install_github("scollinselliott/lakhesis", dependencies = TRUE, build_vignettes = TRUE) 

Usage

To start the Lakhesis Calculator, execute the function LC():

library(lakhesis)
LC()

In uploading a csv file for analysis inside the Lakhesis Calculator, the incidence matrix should be in “long” format. That is, the file should consist of just two columns without headers, in which each row represents the incidence of a row-column pair. For example, an incidence matrix of

\[\begin{array} \, & C_1 & C_2 & C_3 \\\ R_1 & 1 & 0 & 0 \\\ R_2 & 0 & 1 & 1 \\\ R_3 & 0 & 0 & 1 \end{array}\]

will have a corresponding long format of

R1, C1
R2, C2 
R2, C3 
R3, C3

If characters are not displaying properly in the plot, make sure to check font encoding (UTF-8 is recommended).

Row and column elements must be unique (a row element cannot have the same name as a column element).

The Lakhesis Calculator enables the temporary suppression of row or column elements from the plots, with zero rows/columns automatically removed. As such, unexpected results may be elicited if key elements are suppressed. All elements can easily be re-added and the starting incidence matrix re-initialized.

Incidence Matrices

If data are already in incidence matrix format, the im_long() function in lakhesis can be used to convert an incidence matrix to be exported into the necessary long format, using the write.table() function to export (see documentation on im_long()):

# x is a matrix of 0/1 values with unique row/column names
y <- im_long(x)
write.table(y, file = "im.csv", sep = ",")

The file im.csv can then be loaded into the Lakhesis Calculator.

Consensus Seriations

Establishing a consensus seriation via a lakhesis technique can be done in the calculator, but if one has seriations, whether derived by Procustes-fit CA or by another method, one can perform a consensus seriation in the console by creating a strands object and then executing the lakhesize() function.

The console can also be used to perform consensus seriations. For example, using the built-in selection of three strands in the data object qf_strands, a consensus seriation is performed using the lakhesize() function:

x <- lakhesize(qf_strands)
summary(x)

The vignette “A Guide to Lakhesis” contains more information on usage.

References

Chang, W., J. Cheng, J. J. Allaire, C. Sievert, B. Schloerke, Y. Xie, J. Allen, J. McPherson, A. Dipert, and B. Borges. 2024. Shiny: Web Application Framework for R. https://shiny.posit.co.
Eddelbuettel, D., and J. J. Balamuta. 2018. “Extending R with C++: A Brief Introduction to Rcpp.” The American Statistician 72: 28–36. https://doi.org/10.1080/00031305.2017.1375990.
Eddelbuettel, D., and C. Sanderson. 2014. “RcppArmadillo: Accelerating R with high-performance C++ linear algebra.” Computational Statistics and Data Analysis 71: 1054–63. https://doi.org/10.1016/j.csda.2013.02.005.
Hahsler, M., K. Hornik, and C. Buchcta. 2008. “Getting Things in Order: An Introduction to the R Package Seriation.” Journal of Statistical Software 25: 1–34. https://doi.org/10.18637/jss.v025.i03.
Ihm, P. 2005. “A Contribution to the History of Seriation in Archaeology.” In Classification – The Ubiquitous Challenge, edited by C. Weihs and W. Gaul, 307–16. Berlin: Springer.
Oksanen, J., G. L. Simpson, F. G Blanchet, R. Kindt, P. Legendre, P. R. Minchin, R. B. O’Hara, et al. 2024. “Vegan: Community Ecology Package.” https://doi.org/10.32614/CRAN.package.vegan.
ter Braak, C. J. F., and C. W. N. Looman. 1986. “Weighted Averaging, Logistic Regression and the Gaussian Response Model.” Vegetatio 65: 3–11. https://doi.org/10.1007/BF00032121.