This vignette explains the usage of the ipf()
function,
which has been used for calibrating the labour force survey of Austria
for several years. It is based on the Iterative Proportional Fitting
algorithm and gives some flexibility about the details of the
implementation. See (Meraner, Gumprecht, and
Kowarik 2016) or vignette("methodology")
for more
details.
We will assume the output of demo.eusilc()
is our
population. From this population, a sample without replacement is drawn.
The sample covers 10 percent of the population. We assign a weight of
one for all observations of the population and a weight of ten for all
observations of the sample.
library(surveysd)
<- demo.eusilc(1, prettyNames = TRUE)
population := 1]
population[, pWeight <- population[sample(1:.N, floor(.N*0.10)), ]
pop_sample := 10] pop_sample[, pWeight
We will start with an example where we want to adapt the weights of
pop_sample
such that the weighted number of males and
females matches the ones of population
. We can see that
this is currently not the case.
<- xtabs(pWeight ~ gender, population))
(gender_distribution #> gender
#> male female
#> 7267 7560
xtabs(pWeight ~ gender, pop_sample)
#> gender
#> male female
#> 7280 7540
Due to random sampling (rather than stratified sampling), there are
differences between the gender distributions. We can pass
gender_distribution
as a parameter to ipf()
to
obtain modified weights.
<- ipf(pop_sample, conP = list(gender_distribution), w = "pWeight") pop_sample_c
The resulting dataset, pop_sample_c
is similar to
pop_sample
but has an additional column with the adjusted
weights.
dim(pop_sample)
#> [1] 1482 30
dim(pop_sample_c)
#> [1] 1482 31
setdiff(names(pop_sample_c), names(pop_sample))
#> [1] "calibWeight"
We can now calculate the weighted number of males and females according to this new weight. This will result in a match for the constraints.
xtabs(calibWeight ~ gender, pop_sample_c)
#> gender
#> male female
#> 7267 7560
xtabs(pWeight ~ gender, population)
#> gender
#> male female
#> 7267 7560
In this simple case, ipf
just performs a post
stratification step. This means, that all males and all females have the
same weight.
xtabs(~ calibWeight + gender, pop_sample_c)
#> gender
#> calibWeight male female
#> 9.98214285714286 728 0
#> 10.026525198939 0 754
All males have been weighted down (calibWeight < 10
)
to compensate for the overrepresentation in the sample.
Let’s now assume that we want to put constraints on the number of
males and females for each age group. The numbers from the original
population can be obtained with xtabs()
.
<- xtabs(pWeight ~ gender + age, population))
(con_ga #> age
#> gender (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#> male 1528 855 2165 1822 897
#> female 1375 848 2255 1845 1237
xtabs(pWeight ~ gender + age, pop_sample)
#> age
#> gender (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#> male 1300 850 2280 1840 1010
#> female 1210 970 2350 1830 1180
Again, we can see that those constraints are not met. Supplying the
contingency table con_ga
to ipf()
will again
resolve this.
<- ipf(pop_sample, conP = list(con_ga), w = "pWeight")
pop_sample_c2 xtabs(pWeight ~ gender + age, population)
#> age
#> gender (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#> male 1528 855 2165 1822 897
#> female 1375 848 2255 1845 1237
xtabs(calibWeight ~ gender + age, pop_sample_c2)
#> age
#> gender (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#> male 1528 855 2165 1822 897
#> female 1375 848 2255 1845 1237
Now we assume that we know the number of persons living in each nuts2 region from registry data.
<- xtabs(pWeight ~ region, population) registry_table
However, those registry data does not provide any information about
age or gender
. Therefore, the two contingency tables
(con_ga
and registry_table
) have to be
specified independently. This can be done by supplying a list to
conP
<- ipf(pop_sample, conP = list(con_ga, registry_table), w = "pWeight")
pop_sample_c2 xtabs(pWeight ~ gender + age, population)
#> age
#> gender (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#> male 1528 855 2165 1822 897
#> female 1375 848 2255 1845 1237
xtabs(calibWeight ~ gender + age, pop_sample_c2)
#> age
#> gender (-Inf,16] (16,25] (25,45] (45,65] (65, Inf]
#> male 1528 855 2165 1822 897
#> female 1375 848 2255 1845 1237
xtabs(pWeight ~ region, population)
#> region
#> Burgenland Carinthia Lower Austria Salzburg Styria
#> 549 1078 2804 924 2295
#> Tyrol Upper Austria Vienna Vorarlberg
#> 1317 2805 2322 733
xtabs(calibWeight ~ region, pop_sample_c2)
#> region
#> Burgenland Carinthia Lower Austria Salzburg Styria
#> 549.0000 1077.9999 2804.0000 924.0002 2294.9999
#> Tyrol Upper Austria Vienna Vorarlberg
#> 1316.9996 2805.0003 2322.0001 733.0001
this time, the constraints are not matched perfectly. That is,
because we provided more than one constraint. therefore, the
ipf()
algorithm had to work iteratively.
If the dataset has a household structure, household constraints can
be passed via the parameter conH
. If this parameter is
used, it is also necessary to supply hid
, which defines the
column names that contains household ids.
<- xtabs(pWeight ~ hsize + region, data = population[!duplicated(hid)]))
(conH1 #> region
#> hsize Burgenland Carinthia Lower Austria Salzburg Styria Tyrol
#> (0,1] 58 117 325 103 264 118
#> (1,2] 82 126 345 102 260 149
#> (2,3] 37 80 189 55 187 79
#> (3,4] 33 63 169 71 122 102
#> (4,5] 14 22 82 18 49 37
#> (5,Inf] 2 17 21 12 34 11
#> region
#> hsize Upper Austria Vienna Vorarlberg
#> (0,1] 262 431 67
#> (1,2] 321 355 72
#> (2,3] 203 175 44
#> (3,4] 168 96 53
#> (4,5] 79 35 27
#> (5,Inf] 35 15 7
<- ipf(pop_sample, hid = "hid", conH = list(conH1), w = "pWeight",
pop_sample_hh bound = 10)
xtabs(calibWeight ~ hsize + region, data = pop_sample_hh[!duplicated(hid)])
#> region
#> hsize Burgenland Carinthia Lower Austria Salzburg Styria Tyrol
#> (0,1] 0 0 0 0 0 0
#> (1,2] 0 0 0 0 0 0
#> (2,3] 0 0 0 0 0 0
#> (3,4] 0 0 0 0 0 0
#> (4,5] 0 0 0 0 0 0
#> (5,Inf] 0 0 0 0 0 0
#> region
#> hsize Upper Austria Vienna Vorarlberg
#> (0,1] 0 0 0
#> (1,2] 0 0 0
#> (2,3] 0 0 0
#> (3,4] 0 0 0
#> (4,5] 0 0 0
#> (5,Inf] 0 0 0
If conP
or conH
contain several contingency
tables or if conP
and conH
are used at the
same time, the ipf algorithm will operate iteratively. This means that
the calibrated dataset will satisfy the constraints only approximately.
The default tolerances of the approximation can be overwritten using the
parameters conP
and conH
.
Lowering the tolerances will improve the match between the constraints and the contingency tables according to the calibrated weights. However, lower tolerances will also make it so more iterations are necessary until a convergence is met. If the constraints are too small, ipf will return with a warning that indicates that a convergence could not be reached.
ipf(pop_sample, conP = list(con_ga, registry_table), w = "pWeight",
verbose = TRUE, epsP = 0.01)
#> Iteration stopped after 2 steps
#> Convergence reached
ipf(pop_sample, conP = list(con_ga, registry_table), w = "pWeight",
verbose = TRUE, epsP = 0.0001)
#> Iteration stopped after 4 steps
#> Convergence reached
We see that changing the tolerances from 0.01
(one
percent) to 0.0001
increases the number of required
iterations.