The Introduction to synthACS
briefly mentions the
split
and combine_smsm
functionality in
Sections 3.2 and 3.4 respectively. There, we note that deriving the
sample synthetic micro data is a memory intensive process and advise
using synthACS
on a high performance machine. Of course,
such a machine is not always available, which is when split
and combine_smsm
are needed.
A brief illustration of these two functions is provided in this vignette. The same example data is used as in the introductory vignette:
library(data.table)
library(acs)
library(synthACS)
library(retry)
<- geo.make(state = "CA", county = "*")
ca_geo <- pull_synth_data(2014, 5, ca_geo) ca_dat_SMSM
split()
and
combine_smsm()
The split
and combine_smsm
functions are
used, respectively, to reduce the computational requirements of a large
spatial microsimulation task into a set of smaller tasks and to
recombine the results. They enable the well known “split-apply-combine”
strategy for Data Analysis (Wickham 2011). In this case, the “apply”
step is intentionally performed sequentially and not
inside another function in order to minimize RAM usage and enable a
garbage-collection step between intensive in-memory function calls.
The syntax for both is straightforward:
split(<object>, n_splits= N)
combine_smsm(<object1>, <object2>, ..., <objectk>)
split
takes a larger macroASC
class object
and splits it into n_splits
smaller macroACS
objects. Similarly combine_smsm
takes several smaller
smsm_set
objects and combines them into a single, larger,
smsm_set
class object.
An example of this is provided below:
# split()
<- 20
n_splits <- split(ca_dat_SMSM, n_splits = n_splits)
split_ca_dat <- vector("list", length= n_splits)
tmp_opts
for (i in 1:n_splits) {
# Section 3.3 of introduction: SMSM via simulated annealing
# derive synthetic datasets
<- derive_synth_datasets(split_ca_dat[[i]], leave_cores = 0)
tmp_synth
# create constraints for simulated annealing
<- all_geog_constraint_age(tmp_synth, method = "macro.table")
a <- all_geog_constraint_gender(tmp_synth, method = "macro.table")
g <- all_geog_constraint_marital_status(tmp_synth, method = "macro.table")
m <- all_geog_constraint_race(tmp_synth, method = "synthetic")
r <- all_geog_constraint_edu(tmp_synth, method = "synthetic")
e
<- all_geogs_add_constraint(attr_name = "age", attr_total_list = a,
cll macro_micro = tmp_synth)
<- all_geogs_add_constraint(attr_name = "gender", attr_total_list = g,
cll macro_micro = tmp_synth, constraint_list_list = cll)
<- all_geogs_add_constraint(attr_name = "marital_status", attr_total_list = m,
cll macro_micro = tmp_synth, constraint_list_list = cll)
<- all_geogs_add_constraint(attr_name = "race", attr_total_list = r,
cll macro_micro = tmp_synth, constraint_list_list = cll)
<- all_geogs_add_constraint(attr_name = "edu_attain", attr_total_list = e,
cll macro_micro = tmp_synth, constraint_list_list = cll)
# anneal
<- all_geog_optimize_microdata(tmp_synth, seed = 6550L, verbose = TRUE,
tmp_opts[[i]] constraint_list_list = cll, p_accept = 0.4, max_iter = 10000L)
}
# create the string needed for combine_smsm().
paste0("tmp_opts[[", 1:n_splits, "]]", sep= ", ", collapse= "")
# [1] "tmp_opts[[1]], tmp_opts[[2]], tmp_opts[[3]], tmp_opts[[4]], tmp_opts[[5]],
# tmp_opts[[6]], tmp_opts[[7]], tmp_opts[[8]], tmp_opts[[9]], tmp_opts[[10]],
# tmp_opts[[11]], tmp_opts[[12]], tmp_opts[[13]], tmp_opts[[14]], tmp_opts[[15]],
# tmp_opts[[16]], tmp_opts[[17]], tmp_opts[[18]], tmp_opts[[19]], tmp_opts[[20]], "
# copy and paste the resulting string, excluding the final trailing comma
<- combine_smsm(tmp_opts[[1]], tmp_opts[[2]], tmp_opts[[3]], tmp_opts[[4]], tmp_opts[[5]],
opt_ca 6]], tmp_opts[[7]], tmp_opts[[8]], tmp_opts[[9]], tmp_opts[[10]],
tmp_opts[[11]], tmp_opts[[12]], tmp_opts[[13]], tmp_opts[[14]],
tmp_opts[[15]], tmp_opts[[16]], tmp_opts[[17]], tmp_opts[[18]],
tmp_opts[[19]], tmp_opts[[20]]) tmp_opts[[