Use Case 04: Estimation of optimal cluster size for a trial with pre-determined buffer width

The CRTspat package provides functions for analysing this trade-off via simulations for any site for which baseline data are available. An extensive set of simulations is presented by Smith et al, 2025.

The example shown here uses the baseline prevalence data introduced in Use Case 1. The trial is assumed to plan to be based on the same outcome of prevalence, and to be powered for an efficacy of 30%. A set of different algorithmic cluster allocations are carried out with different numbers of clusters. Each allocation is randomized and buffer zones are specified with the a pre-specified width (in this example, 0.5 km). The ICC is computed from the baseline data, excluding the buffer zones, and corresponding power calculations are carried out. The power is calculated and plotted as a function of cluster size.

# use the same dataset as for Use Case 1.
library(CRTspat)
# specify the number of simulations here (a value of 10 is used for code testing, the figures are based on many more)
nsimulations <- 10
example_locations <- readdata('example_site.csv')
example_locations$base_denom <- 1
exampleCRT <- CRTsp(example_locations)
example <- aggregateCRT(exampleCRT,
      auxiliaries = c("RDT_test_result", "base_denom"))
# randomly sample an array of numbers of clusters to allocate
set.seed(5)
c_vec <- round(runif(nsimulations, min = 6, max = 60))

CRTscenario <- function(c, CRT, buffer_width) {
  ex <- specify_clusters(CRT, c = c, algo = "kmeans") %>%
    randomizeCRT() %>%
    specify_buffer(buffer_width = buffer_width)
    GEEanalysis <- CRTanalysis(ex, method = "GEE", baselineOnly = TRUE, excludeBuffer = TRUE,
                               baselineNumerator = "RDT_test_result", baselineDenominator = "base_denom")
  locations <- GEEanalysis$description$locations
  ex_power <- CRTpower(trial = ex, effect = 0.3, yC = GEEanalysis$pt_ests$controlY,
                    outcome_type = "p", N = GEEanalysis$description$sum.denominators/locations, c = c,
                    ICC = GEEanalysis$pt_ests$ICC)
  value <- c(c_full = c, c_core = ex_power$geom_core$c, clustersRequired = ex_power$geom_full$clustersRequired,
             power = ex_power$geom_full$power, mean_h = ex_power$geom_full$mean_h,
             locations = locations, ICC = GEEanalysis$pt_ests$ICC)
  names(value) <- c("c_full", "c_core", "clustersRequired", "power", "mean_h", "locations", "ICC")
  return(value)
}

results <- t(sapply(c_vec, FUN = CRTscenario, simplify = "array", CRT = example,
    buffer_width = 0.5)) %>%
    data.frame()

Each simulated cluster allocation is different, as are the randomizations. This leads to variation in the locations of the buffer zones, so the number of core clusters is a stochastic function of the number of clusters randomised (c). There is also variation in the estimated Intracluster Correlation (see Use Case 3) for any value of c.

total_locations <- example$geom_full$locations
results$proportion_included <- results$c_core * results$mean_h * 2/total_locations
results$corelocations_required <- results$clustersRequired * results$mean_h
results$totallocations_required <- with(results, total_locations/locations *
    corelocations_required)

library(ggplot2)
theme_set(theme_bw(base_size = 14))
ggplot(data = results, aes(x = c_full, y = c_core)) +
    geom_smooth() +
    xlab("Clusters allocated (per arm)") +
    ylab("Clusters in core (per arm)") +
    annotate(
     geom = "segment", x = 5, y = 18.5, xend = 35, yend = 18.5, linewidth = 2, color = 'red',
     arrow = arrow(length = unit(1, "cm"))
 )

The number of clusters in the core area increases with the number of clusters allocated, until the cluster size becomes small enough for entire clusters to be swallowed by the buffer zones. This can be illustrated by the contrast in the core areas randomised with c = 6 and c = 40 (Figures 4.2 and 4.3).

set.seed(7)
library(dplyr)
example6 <- specify_clusters(example, c = 6, algo = "kmeans") %>%
    randomizeCRT() %>%
    specify_buffer(buffer_width = 0.5)
plotCRT(example6, map = TRUE, showClusterBoundaries = TRUE, showClusterLabels = TRUE,
    labelsize = 2, maskbuffer = 0.2)
example40 <- specify_clusters(example, c = 40, algo = "kmeans") %>%
    randomizeCRT() %>%
    specify_buffer(buffer_width = 0.5)
plotCRT(example40, map = TRUE, showClusterBoundaries = TRUE, showClusterLabels = TRUE,
    labelsize = 2, maskbuffer = 0.2)

Beyond this point, increasing the number of clusters allocated in the fixed area (by making them smaller) does not add to the total number of clusters. In this example the maximum is achieved when the input c is about 35 and the output c is 18.5.

ggplot(data = results, aes(x = c_core, y = mean_h)) + geom_smooth() + xlab("Clusters in core (per arm)") +
    ylab("Mean cluster size")

The size of clusters decreases with the number allocated (Figure 4.4), but does not fall much below 10 locations on average in the example because smaller clusters are likely to be absorbed into the buffer zones.

ggplot(data = results, aes(x = c_core, y = power)) + geom_smooth() + xlab("Clusters in core (per arm)") +
    ylab("Power")

The power increases approximately linearly with the number of clusters in the core (Figure 4.5), but the site is too small for an adequate power to be achieved with this size of buffer, irrespective of the cluster size. Because the buffering leads to a maximum in the cluster density (number of clusters per unit area), so does the power achievable with a fixed area (Figure 4.6).

ggplot2::ggplot(data = results, aes(x = c_full, y = power)) + geom_smooth() +
    xlab("Clusters allocated (per arm)") + ylab("Power")

However the analysis also gives an estimate of how large an extended site is needed to achieve adequate power (assuming the the spatial pattern for the wider site to be similar to that of the baseline area). A minimum total number of locations required to achieve a pre-specified power (80%) is achieved at the same density of clusters as the maximum of the power estimated for the smaller, baseline site.

ggplot2::ggplot(data = results, aes(x = c_core, y = corelocations_required)) +
    geom_smooth() + xlab("Clusters in core (per arm)") + ylab("Required core locations")

This is also at the allocation density where saturation is achieved in the number of core clusters (Figure 4.1), and where the proportion of the locations included in the core area reaches its minimum (Figure 4.8).

ggplot2::ggplot(data = results, aes(x = c_core, y = proportion_included)) +
    geom_smooth() + xlab("Clusters in core (per arm)") + ylab("Proportion of locations in core") +
    geom_segment(aes(x = 18, xend = 18, y = 0, yend = 0.25), arrow = arrow(length = unit(1,
        "cm")), lwd = 2, color = "red")

Conclusions

With the example geography and the selected trial outcome, the most efficient trial design, conditional on a buffer width of 0.5 km, would be achieved by assigning about 30 clusters to each arm in a site of the size analysed, though about one third of these clusters would be absorbed into the buffer zones, so that there would be . This would be far from adequate to achieve adequate power. To achieve 80% power about 8,000 locations would be needed, in a larger trial area of which about 2,400 would be in the core (sampled) parts of the clusters.

A more efficient trial design might have narrower or no buffers.

ggplot2::ggplot(data = results, aes(x = c_core, y = totallocations_required)) +
    geom_smooth() + xlab("Clusters in core (per arm)") + ylab("Total locations required") +
    geom_segment(aes(x = 18, xend = 18, y = 0, yend = 8000), arrow = arrow(length = unit(1,
        "cm")), lwd = 2, color = "red")

Fig 4.9 Size of trial area required to achieve adequate power