| Title: | Fast Implementations of Kernel K-Means |
| Version: | 0.1.2 |
| Description: | Implementations several algorithms for kernel k-means. The default 'OTQT' algorithm is a fast alternative to standard implementations of kernel k-means, particularly in cases with many clusters. For a small number of clusters, the implemented 'MacQueen' method typically performs the fastest. For more details and performance evaluations, see Berlinski and Maitra (2025) <doi:10.1002/sam.70032>. |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.2 |
| LinkingTo: | Rcpp |
| Imports: | Rcpp |
| NeedsCompilation: | yes |
| Packaged: | 2026-01-27 15:52:56 UTC; josh |
| Author: | Josh Berlinski [aut, cre] |
| Maintainer: | Josh Berlinski <jdberlinski@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-01-31 19:10:03 UTC |
Classify new data based on kkmeans result
Description
Given training data, test data, and kkmeans-results, get which partitions a new set of data belong to
Usage
cluster_new(test_data, train_data, train_result)
Arguments
test_data |
the new data to be classified |
train_data |
the data to make classifications from |
train_result |
|
Value
A vector of class labels for test_data corresponding to the
clusters present in train_result.
Get the kernel matrix for a dataset
Description
Given a dataset, kernel function, and tuning parameter, will return the n x n kernel matrix
Usage
get_kernel_matrix(data, kern = "g", param1 = 1, param2 = 1)
Arguments
data |
data vector |
kern |
the kernel to use, one of ('gaussian', 'poly', 'sigmoid', 'laplacian'), can use first letter |
param1 |
first parameter to pass to kernel function. |
param2 |
second parameter to pass to kernel function. |
Value
An n x n matrix for data given by the specified kernel. The value
in position (i, j) corresponds to the kernel function evaluated at data[i, ]
and data[j, ].
Get the average distance to each points k-nearest neighbor
Description
Given a dataset and a value k, will return the value of the average distance from each point to it's k-nearest neighbor
Usage
get_mknn_dist(data, k = FALSE)
Arguments
data |
the data vector |
k |
which neighbor to average over |
Value
The average distance from each point to it's k-th nearest neighbor.
Function to get jump statistic for varying values of k
Description
Obtains the jump statistic for a particular kernel for the specified number of clusters
Usage
jump_stat(data, kern = "g", param = 1, k_max, eta, iter_max = 1000L)
Arguments
data |
Numeric data to cluster. This will be converted to a matrix using |
kern |
The kernel to use. |
param |
The parameter value to pass to the kernel |
k_max |
The maximum number of clusters to consider |
eta |
Power for the jump statistic |
iter_max |
Maximum number of iterations to use in |
Value
Sum of squares and value of jump statistic for 1, ..., K chosen clusters
An Efficient Kernel K-Means Algorithm
Description
Performs kernel k-means with the specified kernel using an optimal-transfer quick-transfer algorithm.
Usage
kkmeans(
data,
k,
kern = "g",
param = 1,
param2 = 1,
nstart = 10,
iter_max = 1000L,
estimate = FALSE,
nn = 0,
init_centers = sample(1:k, size = nrow(data), replace = TRUE),
method = c("otqt", "macqueen", "lloyd", "ot"),
trueest = FALSE,
kmat = NULL,
random_centers = TRUE
)
Arguments
data |
Numeric data to cluster. This will be converted to a matrix using |
k |
Number of clusters. |
kern |
Kernel to use, one of ('gaussian', 'poly', 'sigmoid', 'laplacian'). |
param |
value of parameter to pass to kernel function.(eg sigma in
gaussian kernel). The Gaussian kernel is K(x, y) = exp(- ||x - y||^2 / (2* |
param2 |
value of second parameter parameter to pass to the kernel function, which correspond to the offset for the sigmoid and polynomial kernels. |
nstart |
Number of times to run the algorithm. The run with the lowest total within cluster SSE (in feature space) will be returned |
iter_max |
The maximum number of iterations to allow. |
estimate |
If using the Gaussian kernel, specifying |
nn |
How many neighbors to consider for mknn estimation. |
init_centers |
The initial values for cluster membership. If |
method |
Which method to use for kernel k-means iteration. One of ("otqt", "macqueen", "lloyd"). "otqt" is a method using optimal-transfer and quick-transfer heuristics similar to the Hartigan and Wong algorithm for k-means clustering. |
trueest |
Whether or not the within-cluster sum of squares should be recomputed in R after clustering is finished |
kmat |
kernel matrix, if using a custom kernel |
random_centers |
if TRUE, then assign |
Value
A list containing the following useful information
- cluster
The final cluster membership.
- centers
A k x p matrix, the rows of which contain the centers of the clusters in R^n (not to be confused with the clusters in feature space)
- wss
The within-cluster sum of squares for each cluster in feature space.
- param
The parameter value used.
Examples
data <- as.matrix(iris[, 1:4])
# cluster using linear kernel (normal k-means)
result <- kkmeans(data, k = 3, kern = "poly", param = 1)
# cluster using gaussian kernel
# estimating the parameter with 3-nearest neighbors
result <- kkmeans(data, k = 3, kern = "g", estimate = "mknn", nn = 3)
Estimate the bandwidth parameter for a gaussian kernel using MATr
Description
given data and number of clusters K, choose a bandwidth from a grid according to the MATR algorithm
Usage
matr(dat, k, grid, tol = 0.01)
Arguments
dat |
data vector |
k |
number of clusters |
grid |
parameter grid to search on |
tol |
tolerence for choosing the "best" sigma values. Reducing the tolerance will give larger values of the bandwidth parameter |
Value
A list with two elements. The first, l, contains the trace value for
each of the values in grid. The second, best, contains the best value.