| Type: | Package |
| Title: | Inference Based on Non-Probability Samples |
| Version: | 0.2.3 |
| Description: | Statistical inference with non-probability samples when auxiliary information from external sources such as probability samples or population totals or means is available. The package implements various methods such as inverse probability (propensity score) weighting, mass imputation and doubly robust approach. Details can be found in: Chen et al. (2020) <doi:10.1080/01621459.2019.1677241>, Yang et al. (2020) <doi:10.1111/rssb.12354>, Kim et al. (2021) <doi:10.1111/rssa.12696>, Yang et al. (2021) https://www150.statcan.gc.ca/n1/pub/12-001-x/2021001/article/00004-eng.htm and Wu (2022) https://www150.statcan.gc.ca/n1/pub/12-001-x/2022002/article/00002-eng.htm. For details on the package and its functionalities see <doi:10.48550/arXiv.2504.04255>. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.2 |
| URL: | https://github.com/ncn-foreigners/nonprobsvy, https://ncn-foreigners.ue.poznan.pl/nonprobsvy/ |
| BugReports: | https://github.com/ncn-foreigners/nonprobsvy/issues |
| Depends: | R (≥ 4.0.0), survey |
| Imports: | maxLik, stats, Matrix, MASS, ncvreg, RANN, Rcpp (≥ 1.0.12), nleqslv, doParallel, foreach, parallel, formula.tools |
| Suggests: | tinytest, covr, spelling |
| LinkingTo: | Rcpp, RcppArmadillo |
| Language: | en-US |
| NeedsCompilation: | yes |
| Packaged: | 2025-08-20 06:23:18 UTC; berenz |
| Author: | Łukasz Chrostowski [aut, ctb],
Maciej Beręsewicz |
| Maintainer: | Maciej Beręsewicz <maciej.beresewicz@ue.poznan.pl> |
| Repository: | CRAN |
| Date/Publication: | 2025-08-20 07:20:02 UTC |
Admin Data (Non-Probability Survey)
Description
This is a subset of the Central Job Offers Database, a voluntary administrative data set (non-probability sample). The data was slightly manipulated to ensure the relationships were preserved, and then aligned. For more information about the CBOP, please refer to: https://oferty.praca.gov.pl/portal.
Usage
admin
Format
A single data.frame with 9,344 rows and 6 columns
idIdentifier of an entity (company: legal or local).
privateWhether the company is a private (1) or public (0) entity.
sizeThe size of the entity: S – small (to 9 employees), M – medium (10-49) or L – large (over 49).
naceThe main NACE code for a given entity: C, D.E, F, G, H, I, J, K.L, M, N, O, P, Q or R.S (14 levels, 3 combined: D and E, K and L, and R and S).
regionThe region of Poland (16 levels: 02, 04, ..., 32).
single_shiftWhether an entity seeks employees on a single shift.
Examples
data("admin")
head(admin)
Checks Variable Balance Between the Probability and Non-Probability Samples
Description
Function compares totals for auxiliary variables specified in the x argument for an object that
contains either IPW or DR estimator.
Usage
check_balance(x, object, dig)
Arguments
x |
formula specifying variables to check |
object |
object of |
dig |
number of digits for rounding (default = 2) |
Value
A list containing totals for non-probability and probability samples and their differences
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight,
strata = ~ size + nace + region, data = jvs)
ipw_est1 <- nonprob(selection = ~ region + private + nace + size,
target = ~ single_shift,
svydesign = jvs_svy,
data = admin, method_selection = "logit"
)
ipw_est2 <- nonprob(
selection = ~ region + private + nace + size,
target = ~ single_shift,
svydesign = jvs_svy,
data = admin, method_selection = "logit",
control_selection = control_sel(est_method = "gee", gee_h_fun = 1))
## check the balance for the standard IPW
check_balance(~size+private, ipw_est1)
## check the balance for the calibrated IPW
check_balance(~size+private, ipw_est2)
## check balance for a more complicated example
check_balance(~ I(size=="M") + I(nace == "C"), ipw_est1)
Returns Coefficients of the Underlying Models
Description
Returns a list of coefficients for the selection and the outcome models
Usage
## S3 method for class 'nonprob'
coef(object, ...)
Arguments
object |
a |
... |
other arguments passed to methods (currently not supported) |
Value
a list with two entries:
"coef_sel"a matrix of coefficients for the selection equation if possible, else NULL"coef_dr"a matrix of coefficients for the outcome equation(s) if possible, else NULL
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight,
strata = ~ size + nace + region, data = jvs)
ipw_est1 <- nonprob(selection = ~ region + private + nace + size,
target = ~ single_shift,
svydesign = jvs_svy,
data = admin, method_selection = "logit", se = FALSE
)
coef(ipw_est1)
Returns Confidence Intervals for Estimated Mean
Description
A generic function that returns the confidence interval
for the estimated mean. If standard errors have not been estimated, the function
updates the nonprob object to obtain standard errors.
Usage
## S3 method for class 'nonprob'
confint(object, parm, level = 0.95, ...)
Arguments
object |
object of |
parm |
names of parameters for which confidence intervals are to be computed, if missing all parameters will be considered. |
level |
confidence level for intervals. |
... |
additional arguments |
Value
returns a data.frame with confidence intervals for the target variables
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight,
strata = ~ size + nace + region, data = jvs)
ipw_est1 <- nonprob(selection = ~ region + private + nace + size,
target = ~ single_shift,
svydesign = jvs_svy,
data = admin, method_selection = "logit", se = FALSE
)
confint(ipw_est1)
Control Parameters for Inference
Description
control_inf constructs a list with all necessary control parameters
for statistical inference.
Usage
control_inf(
var_method = c("analytic", "bootstrap"),
rep_type = c("subbootstrap", "auto", "JK1", "JKn", "BRR", "bootstrap", "mrbbootstrap",
"Fay"),
vars_selection = FALSE,
vars_combine = FALSE,
bias_correction = FALSE,
num_boot = 500,
alpha = 0.05,
cores = 1,
keep_boot = TRUE,
nn_exact_se = FALSE
)
Arguments
var_method |
the variance method (default |
rep_type |
the replication type for weights in the bootstrap method for variance estimation passed to |
vars_selection |
default |
vars_combine |
whether variables should be combined after variable selection for doubly robust estimators (default |
bias_correction |
default |
num_boot |
the number of iteration for bootstrap algorithms. |
alpha |
significance level (default 0.05). |
cores |
the number of cores in parallel computing (default 1). |
keep_boot |
a logical value indicating whether statistics from bootstrap should be kept (default |
nn_exact_se |
a logical value indicating whether to compute the exact
standard error estimate for |
Value
A list with selected parameters.
See Also
nonprob() – for fitting procedure with non-probability samples.
Control Parameters for Outcome Model
Description
control_out constructs a list with all necessary control parameters
for outcome model.
Usage
control_out(
epsilon = 1e-08,
maxit = 100,
trace = FALSE,
k = 5,
penalty = c("SCAD", "lasso", "MCP"),
a_SCAD = 3.7,
a_MCP = 3,
lambda_min = 0.001,
nlambda = 100,
nfolds = 10,
treetype = c("kd", "rp", "ball"),
searchtype = c("standard", "priority"),
pmm_match_type = 1,
pmm_weights = c("none", "dist"),
pmm_k_choice = c("none", "min_var"),
pmm_reg_engine = c("glm", "loess"),
npar_loess = stats::loess.control(surface = "direct", trace.hat = "approximate")
)
Arguments
epsilon |
Tolerance for fitting algorithms. Default is |
maxit |
Maximum number of iterations. |
trace |
logical value. If |
k |
The k parameter in the |
penalty |
penalty algorithm for variable selection. Default is |
a_SCAD |
The tuning parameter of the SCAD penalty for outcome model. Default is 3.7. |
a_MCP |
The tuning parameter of the MCP penalty for outcome model. Default is 3. |
lambda_min |
The smallest value for lambda, as a fraction of lambda.max. Default is .001. |
nlambda |
The number of lambda values. Default is 100. |
nfolds |
The number of folds during cross-validation for variables selection model. |
treetype |
Type of tree for nearest neighbour imputation (for the NN and PMM estimator) passed to |
searchtype |
Type of search for nearest neighbour imputation (for the NN and PMM estimator) passed to |
pmm_match_type |
(Only for the PMM Estimator)
Indicates how to select 'closest' unit from non-probability sample for each
unit in probability sample. Either |
pmm_weights |
(Only for the PMM Estimator)
Indicate how to weight |
pmm_k_choice |
(Only for the PMM Estimator) Character value indicating how |
pmm_reg_engine |
(Only for the PMM Estimator) whether to use parametric ( |
npar_loess |
control parameters for the stats::loess via the stats::loess.control function. |
Value
List with selected parameters.
See Also
nonprob() – for fitting procedure with non-probability samples.
Control Parameters for the Selection Model
Description
control_sel constructs a list with all necessary control parameters
for selection model.
Usage
control_sel(
est_method = c("mle", "gee"),
gee_h_fun = 1,
optimizer = c("maxLik", "optim"),
maxlik_method = c("NR", "BFGS", "NM"),
optim_method = c("BFGS", "Nelder-Mead"),
epsilon = 1e-04,
maxit = 500,
trace = FALSE,
penalty = c("SCAD", "lasso", "MCP"),
a_SCAD = 3.7,
a_MCP = 3,
lambda = -1,
lambda_min = 0.001,
nlambda = 50,
nfolds = 10,
print_level = 0,
start_type = c("zero", "mle", "naive"),
nleqslv_method = c("Broyden", "Newton"),
nleqslv_global = c("dbldog", "pwldog", "cline", "qline", "gline", "hook", "none"),
nleqslv_xscalm = c("fixed", "auto"),
dependence = FALSE,
key = NULL
)
Arguments
est_method |
Method of estimation for propensity score model ( |
gee_h_fun |
Smooth function for the generalized estimating equations (GEE) method. |
optimizer |
(for the |
maxlik_method |
(for the |
optim_method |
(for the |
epsilon |
Tolerance for fitting algorithms by default |
maxit |
Maximum number of iterations. |
trace |
logical value. If |
penalty |
The penalization function used during variables selection. |
a_SCAD |
The tuning parameter of the SCAD penalty for selection model. Default is 3.7. |
a_MCP |
The tuning parameter of the MCP penalty for selection model. Default is 3. |
lambda |
A user-specified |
lambda_min |
The smallest value for lambda, as a fraction of |
nlambda |
The number of |
nfolds |
The number of folds for cross validation. Default is 10. |
print_level |
this argument determines the level of printing which is done during the optimization (for propensity score model) process. |
start_type |
|
nleqslv_method |
(for the |
nleqslv_global |
(for the |
nleqslv_xscalm |
(for the |
dependence |
logical value (default |
key |
binary key variable allowing to identify the overlap (NOT YET IMPLEMENTED, FOR FUTURE DEVELOPMENT). |
Details
Smooth function (gee_h_fun) for the generalized estimating equations (GEE) method taking the following values
if
1then\boldsymbol{h}\left(\boldsymbol{x}, \boldsymbol{\theta}\right) = \frac{\pi(\boldsymbol{x}, \boldsymbol{\theta})}{\boldsymbol{x}},if
2then\boldsymbol{h}\left(\boldsymbol{x}, \boldsymbol{\theta}\right) = \boldsymbol{x}
Value
List with selected parameters.
See Also
nonprob() – for fitting procedure with non-probability samples.
Extracts Estimates from the Nonprob Class Object
Description
Returns a data.frame of estimated mean(s) or standard error(s)
Usage
extract(object, what)
Arguments
object |
object of of the |
what |
what to extract: all estimates (mean(s), SE(s) and CI(s); |
Value
a data.frame with selected information
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight,
strata = ~ size + nace + region, data = jvs)
ipw_est1 <- nonprob(selection = ~ region + private + nace + size,
target = ~ single_shift,
svydesign = jvs_svy,
data = admin, method_selection = "logit"
)
extract(ipw_est1)
extract(ipw_est1, "se")
Job Vacancy Survey
Description
This is a subset of the Job Vacancy Survey from Poland (for one quarter). The data has been subject to slight manipulation, but the relationships in the data have been preserved. For further details on the JVS, please refer to the following link: https://stat.gov.pl/obszary-tematyczne/rynek-pracy/popyt-na-prace/zeszyt-metodologiczny-popyt-na-prace,3,1.html.
Usage
jvs
Format
A single data.frame with 6,523 rows and 6 columns
idIdentifier of an entity (company: legal or local).
privateWhether the company is a private (1) or public (0) entity.
sizeThe size of the entity: S – small (to 9 employees), M – medium (10-49) or L – large (over 49).
naceThe main NACE code for a given entity: C, D.E, F, G, H, I, J, K.L, M, N, O, P, Q or R.S (14 levels, 3 combined: D and E, K and L, and R and S).
regionThe region of Poland (16 levels: 02, 04, ..., 32).
weightThe final (calibrated) weight (w-weight). We do not have access to design weights (d-weights).
Examples
data("jvs")
head(jvs)
Mass Imputation Using the Generalized Linear Model Method
Description
Model for the outcome for the mass imputation estimator using generalized linear
models via the stats::glm function. Estimation of the mean is done using S_B
probability sample or known population totals.
Usage
method_glm(
y_nons,
X_nons,
X_rand,
svydesign,
weights = NULL,
family_outcome = "gaussian",
start_outcome = NULL,
vars_selection = FALSE,
pop_totals = NULL,
pop_size = NULL,
control_outcome = control_out(),
control_inference = control_inf(),
verbose = FALSE,
se = TRUE
)
Arguments
y_nons |
target variable from non-probability sample |
X_nons |
a |
X_rand |
a |
svydesign |
a svydesign object |
weights |
case / frequency weights from non-probability sample |
family_outcome |
family for the glm model |
start_outcome |
start parameters (default |
vars_selection |
whether variable selection should be conducted |
pop_totals |
population totals from the |
pop_size |
population size from the |
control_outcome |
controls passed by the |
control_inference |
controls passed by the |
verbose |
parameter passed from the main |
se |
whether standard errors should be calculated |
Details
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part (S_A with size n_A; denoted as var_nonprob in the result)
\hat{V}_1 = \frac{1}{n_A^2}\sum_{i=1}^{n_A} \hat{e}_i \left\lbrace \boldsymbol{h}(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})^\prime\hat{\boldsymbol{c}}\right\rbrace,
where \hat{e}_i = y_i - m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}}) and
\widehat{\boldsymbol{c}}=\left\lbrace n_B^{-1} \sum_{i \in B} \dot{\boldsymbol{m}}\left(\boldsymbol{x}_i ; \boldsymbol{\beta}^*\right) \boldsymbol{h}\left(\boldsymbol{x}_i ; \boldsymbol{\beta}^*\right)^{\prime}\right\rbrace^{-1} N^{-1} \sum_{i \in A} w_i \dot{\boldsymbol{m}}\left(\boldsymbol{x}_i ; \boldsymbol{\beta}^*\right).
Under the linear regression model \boldsymbol{h}\left(\boldsymbol{x}_i ; \widehat{\boldsymbol{\beta}}\right)=\boldsymbol{x}_i and \widehat{\boldsymbol{c}}=\left(n_A^{-1} \sum_{i \in A} \boldsymbol{x}_i \boldsymbol{x}_i^{\prime}\right)^{-1} N^{-1} \sum_{i \in B} w_i \boldsymbol{x}_i .
(b) probability part (S_B with size n_B; denoted as var_prob in the result)
This part uses functionalities of the {survey} package and the variance is estimated using the following
equation:
\hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_B} \sum_{j=1}^{n_B} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
\frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_i} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_j}.
Note that \hat{V}_2 in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
Furthermore, if only population totals/means are known and assumed to be fixed we set \hat{V}_2=0.
Information on the case when svydesign is not available:
variance is estimated only for the non-probability part with
\hat{V}_1defined above.point estimator of
\hat{\mu}_yfor linear regression is estimated using\mu_x^\prime\hat{\boldsymbol{\beta}}where\mu_xis the vector of population meansfor non-linear functions such as logistic or Poisson regression we use a simplification, i.e. we report point estimate as
\exp(\mu_x^\prime\hat{\boldsymbol{\beta}})for Poisson and\frac{\exp(\mu_x^\prime\hat{\boldsymbol{\beta}})}{1+\exp(\mu_x^\prime\hat{\boldsymbol{\beta}})}for logistic regression.
Value
an nonprob_method class which is a list with the following entries
- model_fitted
fitted model either an
glm.fitorcv.ncvregobject- y_nons_pred
predicted values for the non-probablity sample
- y_rand_pred
predicted values for the probability sample or population totals
- coefficients
coefficients for the model (if available)
- svydesign
an updated
surveydesign2object (new columny_hat_MIis added)- y_mi_hat
estimated population mean for the target variable
- vars_selection
whether variable selection was performed
- var_prob
variance for the probability sample component (if available)
- var_nonprob
variance for the non-probability sampl component
- var_total
total variance, if possible it should be
var_prob+var_nonprobif not, just a scalar- model
model type (character
"glm")- family
family type (character
"glm")
References
Kim, J. K., Park, S., Chen, Y., & Wu, C. (2021). Combining non-probability and probability survey samples through mass imputation. Journal of the Royal Statistical Society Series A: Statistics in Society, 184(3), 941-963.
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs)
res_glm <- method_glm(y_nons = admin$single_shift,
X_nons = model.matrix(~ region + private + nace + size, admin),
X_rand = model.matrix(~ region + private + nace + size, jvs),
svydesign = jvs_svy)
res_glm
Mass Imputation Using Nearest Neighbours Matching Method
Description
Mass imputation using nearest neighbours approach as described in Yang et al. (2021).
The implementation is currently based on RANN::nn2 function and thus it uses
Euclidean distance for matching units from S_A (non-probability) to S_B (probability).
Estimation of the mean is done using S_B sample.
Usage
method_nn(
y_nons,
X_nons,
X_rand,
svydesign,
weights = NULL,
family_outcome = NULL,
start_outcome = NULL,
vars_selection = FALSE,
pop_totals = NULL,
pop_size = NULL,
control_outcome = control_out(),
control_inference = control_inf(),
verbose = FALSE,
se = TRUE
)
Arguments
y_nons |
target variable from non-probability sample |
X_nons |
a |
X_rand |
a |
svydesign |
a svydesign object |
weights |
case / frequency weights from non-probability sample |
family_outcome |
a placeholder (not used in |
start_outcome |
a placeholder (not used in |
vars_selection |
whether variable selection should be conducted |
pop_totals |
a placeholder (not used in |
pop_size |
population size from the |
control_outcome |
controls passed by the |
control_inference |
controls passed by the |
verbose |
parameter passed from the main |
se |
whether standard errors should be calculated |
Details
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part (S_A with size n_A; denoted as var_nonprob in the result)
This may be estimated using
\hat{V}_1 = \frac{1}{N^2}\sum_{i=1}^{S_A}\frac{1-\hat{\pi}_B(\boldsymbol{x}_i)}{\hat{\pi}_B(\boldsymbol{x}_i)}\hat{\sigma}^2(\boldsymbol{x}_i),
where \hat{\pi}_B(\boldsymbol{x}_i) is an estimator of propensity scores which
we currently estimate using n_A/N (constant) and \hat{\sigma}^2(\boldsymbol{x}_i) is
estimated using based on the average of (y_i - y_i^*)^2.
Chlebicki et al. (2025, Algorithm 2) proposed non-parametric mini-bootstrap estimator
(without assuming that it is consistent) but with good finite population properties.
This bootstrap can be applied using control_inference(nn_exact_se=TRUE) and
can be summarized as follows:
Sample
n_Aunits fromS_Awith replacement to createS_A'(if pseudo-weights are present inclusion probabilities should be proportional to their inverses).Match units from
S_BtoS_A'to obtain predictionsy^*={k}^{-1}\sum_{k}y_k.Estimate
\hat{\mu}=\frac{1}{N} \sum_{i \in S_B} d_i y_i^*.Repeat steps 1-3
Mtimes (we setM=50in our simulations; this is hard-coded).Estimate
\hat{V}_1=\text{var}({\hat{\boldsymbol{\mu}}})obtained from simulations and save it asvar_nonprob.
(b) probability part (S_B with size n_B; denoted as var_prob in the result)
This part uses functionalities of the {survey} package and the variance is estimated using the following
equation:
\hat{V}_2=\frac{1}{N^2} \sum_{i=1}^n \sum_{j=1}^n \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
\frac{y_i^*}{\pi_i} \frac{y_j^*}{\pi_j},
where y^*_i and y_j^* are values imputed imputed as an average
of k-nearest neighbour, i.e. {k}^{-1}\sum_{k}y_k. Note that \hat{V}_2 in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
Value
an nonprob_method class which is a list with the following entries
- model_fitted
RANN::nn2object- y_nons_pred
predicted values for the non-probablity sample (query to itself)
- y_rand_pred
predicted values for the probability sample
- coefficients
coefficients for the model (if available)
- svydesign
an updated
surveydesign2object (new columny_hat_MIis added)- y_mi_hat
estimated population mean for the target variable
- vars_selection
whether variable selection was performed (not implemented, for further development)
- var_prob
variance for the probability sample component (if available)
- var_nonprob
variance for the non-probability sample component
- var_tot
total variance, if possible it should be
var_prob+var_nonprobif not, just a scalar- model
model type (character
"nn")- family
placeholder for the
NN approachinformation
References
Yang, S., Kim, J. K., & Hwang, Y. (2021). Integration of data from probability surveys and big found data for finite population inference using mass imputation. Survey Methodology, June 2021 29 Vol. 47, No. 1, pp. 29-58
Chlebicki, P., Chrostowski, Ł., & Beręsewicz, M. (2025). Data integration of non-probability and probability samples with predictive mean matching. arXiv preprint arXiv:2403.13750.
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs)
res_nn <- method_nn(y_nons = admin$single_shift,
X_nons = model.matrix(~ region + private + nace + size, admin),
X_rand = model.matrix(~ region + private + nace + size, jvs),
svydesign = jvs_svy)
res_nn
Mass Imputation Using Non-Parametric Model Method
Description
Model for the outcome for the mass imputation estimator using loess via stats::loess.
Estimation of the mean is done using the S_B probability sample.
Usage
method_npar(
y_nons,
X_nons,
X_rand,
svydesign,
weights = NULL,
family_outcome = "gaussian",
start_outcome = NULL,
vars_selection = FALSE,
pop_totals = NULL,
pop_size = NULL,
control_outcome = control_out(),
control_inference = control_inf(),
verbose = FALSE,
se = TRUE
)
Arguments
y_nons |
target variable from non-probability sample |
X_nons |
a |
X_rand |
a |
svydesign |
a svydesign object |
weights |
case / frequency weights from non-probability sample (default NULL) |
family_outcome |
family for the glm model) |
start_outcome |
a place holder (not used in |
vars_selection |
whether variable selection should be conducted |
pop_totals |
a place holder (not used in |
pop_size |
population size from the |
control_outcome |
controls passed by the |
control_inference |
controls passed by the |
verbose |
parameter passed from the main |
se |
whether standard errors should be calculated |
Details
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part (S_A with size n_A; denoted as var_nonprob in the result)
\hat{V}_1 = \frac{1}{N^2} \sum_{i=1}^{n_A} \left\lbrace\hat{g}_B(\boldsymbol{x}_i)\right\rbrace^{2} \hat{e}_i^2,
where \hat{e}_i=y_i - \hat{m}(x_i) is the residual and \hat{g}_B(\boldsymbol{x}_i) = \left\lbrace \pi_B(\boldsymbol{x}_i) \right\rbrace^{-1} can be estimated
various ways. In the package we estimate \hat{g}_B(\boldsymbol{x}_i) using \pi_B(\boldsymbol{x}_i)=E(R | \boldsymbol{x}) as suggested by Chen et al. (2022, p. 6). In particular,
we currently support this using stats::loesswith"gaussian"' family.
(b) probability part (S_B with size n_B; denoted as var_prob in the result)
This part uses functionalities of the {survey} package and the variance is estimated using the following
equation:
\hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_B} \sum_{j=1}^{n_B} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
\frac{\hat{m}(x_i)}{\pi_i} \frac{\hat{m}(x_j)}{\pi_j}.
Note that \hat{V}_2 in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
Value
an nonprob_method class which is a list with the following entries
- model_fitted
fitted model object returned by
stats::loess- y_nons_pred
predicted values for the non-probablity sample
- y_rand_pred
predicted values for the probability sample or population totals
- coefficients
coefficients for the model (if available)
- svydesign
an updated
surveydesign2object (new columny_hat_MIis added)- y_mi_hat
estimated population mean for the target variable
- vars_selection
whether variable selection was performed
- var_prob
variance for the probability sample component (if available)
- var_nonprob
variance for the non-probability sampl component
- model
model type (character
"npar")
References
Chen, S., Yang, S., & Kim, J. K. (2022). Nonparametric mass imputation for data integration. Journal of Survey Statistics and Methodology, 10(1), 1-24.
Examples
set.seed(123123123)
N <- 10000
n_a <- 500
n_b <- 1000
n_b1 <- 0.7*n_b
n_b2 <- 0.3*n_b
x1 <- rnorm(N, 2, 1)
x2 <- rnorm(N, 2, 1)
y1 <- rnorm(N, 0.3 + 2*x1+ 2*x2, 1)
y2 <- rnorm(N, 0.3 + 0.5*x1^2+ 0.5*x2^2, 1)
strata <- x1 <= 2
pop <- data.frame(x1, x2, y1, y2, strata)
sample_a <- pop[sample(1:N, n_a),]
sample_a$w_a <- N/n_a
sample_a_svy <- svydesign(ids=~1, weights=~w_a, data=sample_a)
pop1 <- subset(pop, strata == TRUE)
pop2 <- subset(pop, strata == FALSE)
sample_b <- rbind(pop1[sample(1:nrow(pop1), n_b1), ],
pop2[sample(1:nrow(pop2), n_b2), ])
res_y_npar <- nonprob(outcome = y1 + y2 ~ x1 + x2,
data = sample_b,
svydesign = sample_a_svy,
method_outcome = "npar")
res_y_npar
Mass Imputation Using Predictive Mean Matching Method
Description
Model for the outcome for the mass imputation estimator. The implementation is currently based on RANN::nn2 function and thus it uses Euclidean distance for matching units from S_A (non-probability) to S_B (probability) based on predicted values from model \boldsymbol{x}_i based
either on method_glm or method_npar. Estimation of the mean is done using S_B sample.
This implementation extends Yang et al. (2021) approach as described in Chlebicki et al. (2025), namely:
- pmm_weights
if k>1 weighted aggregation of the mean for a given unit is used. We use distance matrix returned by RANN::nn2 function (
pmm_weightsfrom thecontrol_out()function)- nn_exact_se
if the non-probability sample is small we recommend using a mini-bootstrap approach to estimate variance from the non-probability sample (
nn_exact_sefrom thecontrol_inf()function)- pmm_k_choice
the main
nonprobfunction allows for dynamic selection ofkneighbours based on the variance minimization procedure (pmm_k_choicefrom thecontrol_out()function)
Usage
method_pmm(
y_nons,
X_nons,
X_rand,
svydesign,
weights = NULL,
family_outcome = "gaussian",
start_outcome = NULL,
vars_selection = FALSE,
pop_totals = NULL,
pop_size = NULL,
control_outcome = control_out(),
control_inference = control_inf(),
verbose = FALSE,
se = TRUE
)
Arguments
y_nons |
target variable from non-probability sample |
X_nons |
a |
X_rand |
a |
svydesign |
a svydesign object |
weights |
case / frequency weights from non-probability sample |
family_outcome |
family for the glm model |
start_outcome |
start parameters |
vars_selection |
whether variable selection should be conducted |
pop_totals |
a place holder (not used in |
pop_size |
population size from the |
control_outcome |
controls passed by the |
control_inference |
controls passed by the |
verbose |
parameter passed from the main |
se |
whether standard errors should be calculated |
Details
Matching
In the package we support two types of matching:
-
\hat{y} - \hat{y}matching (default;control_out(pmm_match_type = 1)). -
\hat{y} - ymatching (control_out(pmm_match_type = 2)).
Analytical variance
The variance of the mean is estimated based on the following approach
(a) non-probability part (S_A with size n_A; denoted as var_nonprob in the result) is currently estimated using the non-parametric mini-bootstrap estimator proposed by
Chlebicki et al. (2025, Algorithm 2). It is not proved to be consistent but with good finite population properties.
This bootstrap can be applied using control_inference(nn_exact_se=TRUE) and
can be summarized as follows:
Sample
n_Aunits fromS_Awith replacement to createS_A'(if pseudo-weights are present inclusion probabilities should be proportional to their inverses).Estimate regression model
\mathbb{E}[Y|\boldsymbol{X}]=m(\boldsymbol{X}, \cdot)based onS_{A}'from step 1.Compute
\hat{\nu}'(i,t)fort=1,\dots,k, i\in S_{B}using estimatedm(\boldsymbol{x}', \cdot)and\left\lbrace(y_{j},\boldsymbol{x}_{j})| j\in S_{A}'\right\rbrace.Compute
\displaystyle\frac{1}{k}\sum_{t=1}^{k}y_{\hat{\nu}'(i)}usingYvalues fromS_{A}'.Repeat steps 1-4
Mtimes (we set (hard-coded)M=50in our code).Estimate
\hat{V}_1=\text{var}({\hat{\boldsymbol{\mu}}})obtained from simulations and save it asvar_nonprob.
(b) probability part (S_B with size n_B; denoted as var_prob in the result)
This part uses functionalities of the {survey} package and the variance is estimated using the following
equation:
\hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_B} \sum_{j=1}^{n_B} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}}
\frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_i} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_j}.
Note that \hat{V}_2 in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.
Value
an nonprob_method class which is a list with the following entries
- model_fitted
fitted model either an
glm.fitorcv.ncvregobject- y_nons_pred
predicted values for the non-probablity sample
- y_rand_pred
predicted values for the probability sample or population totals
- coefficients
coefficients for the model (if available)
- svydesign
an updated
surveydesign2object (new columny_hat_MIis added)- y_mi_hat
estimated population mean for the target variable
- vars_selection
whether variable selection was performed
- var_prob
variance for the probability sample component (if available)
- var_nonprob
variance for the non-probability sampl component
- model
model type (character
"pmm")- family
depends on the method selected for estimating E(Y|X)
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight, strata = ~ size + nace + region, data = jvs)
res_pmm <- method_pmm(y_nons = admin$single_shift,
X_nons = model.matrix(~ region + private + nace + size, admin),
X_rand = model.matrix(~ region + private + nace + size, jvs),
svydesign = jvs_svy)
res_pmm
Propensity Score Model Functions
Description
Function to specify the propensity score (PS) model for the inverse probability weighting estimator.
This function provides basic functions logistic regression with a given link function (currently
we support logit, probit and cloglog) with additional information about the analytic variance estimator of the mean.
This is a function returns a list of functions that refer to specific estimation methods and variance estimators
when whether the IPW alone or the DR estimator is applied. The export of this function is mainly because
the functions are used in the variable selection algorithms.
Functions starting with make_log_like, make_gradient and make_hessian refer to the maximum likelihood estimation
as described in the Chen et al. (2020) paper. These functions take into account different link functions defined through
the link argument.
Functions make_link, make_link_inv, make_link_der, make_link_inv_der, make_link_inv_rev, and make_link_inv_rev_der
refer to specific link functions and are used in the estimation process.
Functions variance_covariance1 and variance_covariance2 refer to the variance estimator of the IPW estimator as
defined by Chen et al. (2020).
Functions b_vec_ipw, b_vec_dr and t_vec are specific functions defined in the Chen et al. (2020) that are used
in the variance estimator of the IPW or the DR.
Finally, var_nonprob is the non-probability component of the DR estimator as defined by Chen et al. (2020).
Usage
method_ps(link = c("logit", "probit", "cloglog"), ...)
Arguments
link |
link for the PS model |
... |
Additional, optional arguments. |
Value
A list of functions and elements for a specific link function with the following entries:
- make_log_like
log-likelihood function for a specific link function
- make_gradient
gradient of the loglik
- make_hessian
hessian of the loglik
- make_link
link function
- make_link_inv
inverse link function
- make_link_der
first derivative of the link function
- make_link_inv_der
first derivative of the the inverse link function
- make_link_inv_rev
defines 1/inv_link
- make_link_inv_rev_der
first derivative of 1/inv_link
- variance_covariance1
for the IPW estimator: variance component for the non-probability sample
- variance_covariance2
for the IPW estimator: variance component for the probability sample
- b_vec_ipw
for the IPW estimator: the
bfunction as defined in the Chen et al. (2020, sec. 3.2, eq. (9)-(10); sec 4.1)- b_vec_dr
for the DR estimator: the
bfunction as defined in the Chen et al. (2020, sec. 3.3., eq. (14); sec 4.1)- t_vec
for the DR estimator: the
bfunction as defined in the Chen et al. (2020, sec. 3.3., eq. (14); sec 4.1)- var_nonprob
for the DR estimator: non-probability component of the variance for DR estimator
- link
name of the selected link function for the PS model (character)
- model
model type (character)
Author(s)
Łukasz Chrostowski, Maciej Beręsewicz
Examples
# Printing information on the model selected
method_ps()
# extracting specific field
method_ps("cloglog")$make_gradient
Returns the Number of Rows in Samples
Description
Returns information on the number of rows of the probability sample (if provided) and non-probability sample.
Usage
## S3 method for class 'nonprob'
nobs(object, ...)
Arguments
object |
a |
... |
other arguments passed to methods (currently not supported) |
Value
a named vector with row numbers
Inference with Non-Probability Survey Samples
Description
nonprob function provides an access to the various methods for inference based on non-probability surveys (including big data). The function allows to estimate the population mean based on the access to a reference probability sample (via the survey package), as well as totals or means of covariates.
The package implements state-of-the-art approaches recently proposed in the literature: Chen et al. (2020),
Yang et al. (2020), Wu (2022) and uses the Lumley 2004 survey package for inference (if a reference probability sample is provided).
It provides various inverse probability weighting (e.g. with calibration constraints), mass imputation (e.g. nearest neighbour, predictive mean matching) and doubly robust estimators (e.g. that take into account minimisation of the asymptotic bias of the population mean estimators).
The package uses the survey package functionality when a probability sample is available.
All optional parameters are set to NULL. The obligatory ones include data as well as one of the following three:
selection, outcome, or target – depending on which method has been selected.
In the case of outcome and target multiple y variables can be specified.
Usage
nonprob(
data,
selection = NULL,
outcome = NULL,
target = NULL,
svydesign = NULL,
pop_totals = NULL,
pop_means = NULL,
pop_size = NULL,
method_selection = c("logit", "cloglog", "probit"),
method_outcome = c("glm", "nn", "pmm", "npar"),
family_outcome = c("gaussian", "binomial", "poisson"),
subset = NULL,
strata = NULL,
case_weights = NULL,
na_action = na.omit,
control_selection = control_sel(),
control_outcome = control_out(),
control_inference = control_inf(),
start_selection = NULL,
start_outcome = NULL,
verbose = FALSE,
se = TRUE,
...
)
Arguments
data |
a |
selection |
a |
outcome |
a |
target |
a |
svydesign |
an optional |
pop_totals |
an optional |
pop_means |
an optional |
pop_size |
an optional |
method_selection |
a |
method_outcome |
a |
family_outcome |
a |
subset |
an optional |
strata |
an optional |
case_weights |
an optional |
na_action |
a function which indicates what should happen when the data contain |
control_selection |
a |
control_outcome |
a |
control_inference |
a |
start_selection |
an optional |
start_outcome |
an optional |
verbose |
a numerical value (default |
se |
Logical value (default |
... |
Additional, optional arguments (not yet supported) |
Details
Let y be the response variable for which we want to estimate the population mean,
given by
\mu_{y} = \frac{1}{N} \sum_{i=1}^N y_{i}.
For this purpose we consider data integration
with the following structure. Let S_A be the non-probability sample with the design matrix of covariates as
\boldsymbol{X}_A =
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1p} \cr
x_{21} & x_{22} & \cdots & x_{2p} \cr
\vdots & \vdots & \ddots & \vdots \cr
x_{n_{A1}} & x_{n_{A2}} & \cdots & x_{n_{Ap}} \cr
\end{bmatrix},
and vector of outcome variable
\boldsymbol{y} =
\begin{bmatrix}
y_{1} \cr
y_{2} \cr
\vdots \cr
y_{n_{A}}
\end{bmatrix}.
On the other hand, let S_B be the probability sample with design matrix of covariates be
\boldsymbol{X}_B =
\begin{bmatrix}
x_{11} & x_{12} & \cdots & x_{1p} \cr
x_{21} & x_{22} & \cdots & x_{2p} \cr
\vdots & \vdots & \ddots & \vdots \cr
x_{n_{B1}} & x_{n_{B2}} & \cdots & x_{n_{Bp}}\cr
\end{bmatrix}.
Instead of a sample of units we can consider a vector of population sums in the form of \tau_x = (\sum_{i \in \mathcal{U}}\boldsymbol{x}_{i1}, \sum_{i \in \mathcal{U}}\boldsymbol{x}_{i2}, ..., \sum_{i \in \mathcal{U}}\boldsymbol{x}_{ip}) or means
\frac{\tau_x}{N}, where \mathcal{U} refers to a finite population. Note that we do not assume access to the response variable for S_B.
In general we make the following assumptions:
The selection indicator of belonging to non-probability sample
R_{i}and the response variabley_iare independent given the set of covariates\boldsymbol{x}_i.All units have a non-zero propensity score, i.e.,
\pi_{i}^{A} > 0for all i.The indicator variables
R_{i}^{A}andR_{j}^{A}are independent for given\boldsymbol{x}_iand\boldsymbol{x}_jfori \neq j.
There are three possible approaches to the problem of estimating population mean using non-probability samples:
Inverse probability weighting – the main drawback of non-probability sampling is the unknown selection mechanism for a unit to be included in the sample. This is why we talk about the so-called "biased sample" problem. The inverse probability approach is based on the assumption that a reference probability sample is available and therefore we can estimate the propensity score of the selection mechanism. The estimator has the following form:
\hat{\mu}_{IPW} = \frac{1}{N^{A}}\sum_{i \in S_{A}} \frac{y_{i}}{\hat{\pi}_{i}^{A}}.For this purpose several estimation methods can be considered. The first approach is maximum likelihood estimation with a corrected log-likelihood function, which is given by the following formula
\ell^{*}(\boldsymbol{\theta}) = \sum_{i \in S_{A}}\log \left\lbrace \frac{\pi(\boldsymbol{x}_{i}, \boldsymbol{\theta})}{1 - \pi(\boldsymbol{x}_{i},\boldsymbol{\theta})}\right\rbrace + \sum_{i \in S_{B}}d_{i}^{B}\log \left\lbrace 1 - \pi({\boldsymbol{x}_{i},\boldsymbol{\theta})}\right\rbrace.In the literature, the main approach to modelling propensity scores is based on the logit link function. However, we extend the propensity score model with the additional link functions such as cloglog and probit. The pseudo-score equations derived from ML methods can be replaced by the idea of generalised estimating equations with calibration constraints defined by equations.
\mathbf{U}(\boldsymbol{\theta})=\sum_{i \in S_A} \mathbf{h}\left(\mathbf{x}_i, \boldsymbol{\theta}\right)-\sum_{i \in S_B} d_i^B \pi\left(\mathbf{x}_i, \boldsymbol{\theta}\right) \mathbf{h}\left(\mathbf{x}_i, \boldsymbol{\theta}\right).Notice that for
\mathbf{h}\left(\mathbf{x}_i, \boldsymbol{\theta}\right) = \frac{\pi(\boldsymbol{x}, \boldsymbol{\theta})}{\boldsymbol{x}}We do not need a probability sample and can use a vector of population totals/means.Mass imputation – This method is based on a framework where imputed values of outcome variables are created for the entire probability sample. In this case, we treat the large sample as a training data set that is used to build an imputation model. Using the imputed values for the probability sample and the (known) design weights, we can build a population mean estimator of the form:
\hat{\mu}_{MI} = \frac{1}{N^B}\sum_{i \in S_{B}} d_{i}^{B} \hat{y}_i.It opens the door to a very flexible method for imputation models. The package uses generalized linear models from
stats::glm(), the nearest neighbour algorithm usingRANN::nn2()and predictive mean matching.Doubly robust estimation – The IPW and MI estimators are sensitive to misspecified models for the propensity score and outcome variable, respectively. To this end, so-called doubly robust methods are presented that take these problems into account. It is a simple idea to combine propensity score and imputation models during inference, leading to the following estimator
\hat{\mu}_{DR} = \frac{1}{N^A}\sum_{i \in S_A} \hat{d}_i^A (y_i - \hat{y}_i) + \frac{1}{N^B}\sum_{i \in S_B} d_i^B \hat{y}_i.In addition, an approach based directly on bias minimisation has been implemented. The following formula
\begin{aligned} bias(\hat{\mu}_{DR}) = & \mathbb{E} (\hat{\mu}_{DR} - \mu) \cr = & \mathbb{E} \left\lbrace \frac{1}{N} \sum_{i=1}^N (\frac{R_i^A}{\pi_i^A (\boldsymbol{x}_i^{\mathrm{T}} \boldsymbol{\theta})} - 1 ) (y_i - \operatorname{m}(\boldsymbol{x}_i^{\mathrm{T}} \boldsymbol{\beta})) \right\rbrace \cr + & \mathbb{E} \left\lbrace \frac{1}{N} \sum_{i=1}^N (R_i^B d_i^B - 1) \operatorname{m}( \boldsymbol{x}_i^{\mathrm{T}} \boldsymbol{\beta}) \right\rbrace, \end{aligned}lead us to system of equations
\begin{aligned} J(\theta, \beta) = \left\lbrace \begin{array}{c} J_1(\theta, \beta) \cr J_2(\theta, \beta) \end{array}\right\rbrace = \left\lbrace \begin{array}{c} \sum_{i=1}^N R_i^A\ \left\lbrace \frac{1}{\pi(\boldsymbol{x}_i, \boldsymbol{\theta})}-1 \right\rbrace \left\lbrace y_i-m(\boldsymbol{x}_i, \boldsymbol{\beta}) \right\rbrace \boldsymbol{x}_i \cr \sum_{i=1}^N \frac{R_i^A}{\pi(\boldsymbol{x}_i, \boldsymbol{\theta})} \frac{\partial m(\boldsymbol{x}_i, \boldsymbol{\beta})}{\partial \boldsymbol{\beta}} - \sum_{i \in \mathcal{S}_{\mathrm{B}}} d_i^{\mathrm{B}} \frac{\partial m(\boldsymbol{x}_i, \boldsymbol{\beta})}{\partial \boldsymbol{\beta}} \end{array} \right\rbrace, \end{aligned}where
m\left(\boldsymbol{x}_{i}, \boldsymbol{\beta}\right)is a mass imputation (regression) model for the outcome variable and propensity scores\pi_i^Aare estimated using alogitfunction for the model. As with theMLEandGEEapproaches we have extended this method tocloglogandprobitlinks.
As it is not straightforward to calculate the variances of these estimators, asymptotic equivalents of the variances derived using the Taylor approximation have been proposed in the literature. Details can be found here. In addition, the bootstrap approach can be used for variance estimation.
The function also allows variables selection using known methods that have been implemented to handle the integration of probability and non-probability sampling.
In the presence of high-dimensional data, variable selection is important, because it can reduce the variability in the estimate that results from using irrelevant variables to build the model.
Let \operatorname{U}\left( \boldsymbol{\theta}, \boldsymbol{\beta} \right) be the joint estimating function for \left( \boldsymbol{\theta}, \boldsymbol{\beta} \right). We define the
penalized estimating functions as
\operatorname{U}^p \left(\boldsymbol{\theta}, \boldsymbol{\beta}\right) =
\operatorname{U}\left(\boldsymbol{\theta}, \boldsymbol{\beta}\right) -
\left\lbrace
\begin{array}{c}
q_{\lambda_\theta}(|\boldsymbol{\theta}|) \operatorname{sgn}(\boldsymbol{\theta}) \\
q_{\lambda_\beta}(|\boldsymbol{\beta}|) \operatorname{sgn}(\boldsymbol{\beta})
\end{array}
\right\rbrace,
where \lambda_{\theta} and q_{\lambda_{\beta}} are some smooth functions. We let q_{\lambda} \left(x\right) = \frac{\partial p_{\lambda}}{\partial x}, where p_{\lambda} is some penalization function.
Details of penalization functions and techniques for solving this type of equation can be found here.
To use the variable selection model, set the vars_selection parameter in the control_inf() function to TRUE. In addition, in the other control functions such as control_sel() and control_out()
you can set parameters for the selection of the relevant variables, such as the number of folds during cross-validation algorithm or the lambda value for penalizations. Details can be found
in the documentation of the control functions for nonprob.
Value
Returns an object of the nonprob class (it is actually a list) which contains the following elements:
call– the call of thenonprobfunctiondata– adata.framepassed from thenonprobfunctiondataargumentX– amodel.matrixcontaining data from probability (firstn_{S_B}rows) and non-probability samples (nextn_{S_B}rows) if specified at a function cally– alistof vector of outcome variables if specified at a function callR– anumeric vectorindicating whether a unit belongs to the probability (0) or non-probability (1) units in the matrix Xps_scores– anumeric vectorof estimated propensity scores for probability and non-probability samplecase_weights– avectorof case weights for non-probability sample based on the callipw_weights– avectorof inverse probability weights for non-probability sample (if applicable)control– alistof control functions based on the calloutput– adata.framewith the estimated means and standard errors for the variables specified in thetargetoroutcomeargumentsSE– adata.framewith standard error of the estimator of the population mean, divided into errors from probability and non-probability samples (if applicable)confidence_interval– adata.framewith confidence interval of population mean estimatornonprob_size– a scalarnumeric vectordenoting the size of non-probability sampleprob_size– a scalarnumeric vectordenoting the size of probability samplepop_size– a scalarnumeric vectorestimated population size derived from estimated weights (non-probability sample) or known design weights (probability sample)pop_size_fixed– alogicalvalue whether the population size was fixed (known) or estimated (unknown)pop_totals– anumeric vectorwith the total values of the auxiliary variables derived from a probability sample or based on the callpop_means– anumeric vectorwith the mean values of the auxiliary variables derived from a probability sample or based on the calloutcome– alistcontaining information about the fitting of the mass imputation model. Structure of the object is based on themethod_outcomeandfamily_outcomearguments which point to specific methods as defined by functionsmethod_*(if specified in the call)selection– alistcontaining information about the fitting of the propensity score model. Structure of the object is based on themethod_selectionargument which point to specific methods as defined by functionsmethod_ps(if specified in the call)boot_sample– amatrixwith bootstrap estimates of the target variable(s) (if specified)svydesign– asvydesign2object (if specified)ys_rand_pred– alistof predicted values for the target variable(s) for the probability sample (for the MI and DR estimator)ys_nons_pred– alistof predicted values for the target variable(s) for the non-probability sample (for the MI and DR estimator)ys_resid– alistof residuals for the target variable(s) for the non-probability sample (for the MI and DR estimator)estimator– acharacter vectorwith information what type of estimator was selected (one ofc("ipw", "mi", "dr")).selection_formula– aformulabased on theselectionargument (if specified)estimator_method– acharacter vectorwith information on the detailed method applied (for theprintmethod)
Author(s)
Łukasz Chrostowski, Maciej Beręsewicz, Piotr Chlebicki
References
Kim JK, Park S, Chen Y, Wu C. Combining non-probability and probability survey samples through mass imputation. J R Stat Soc Series A. 2021;184:941– 963.
Shu Yang, Jae Kwang Kim, Rui Song. Doubly robust inference when combining probability and non-probability samples with high dimensional data. J. R. Statist. Soc. B (2020)
Yilin Chen , Pengfei Li & Changbao Wu (2020) Doubly Robust Inference With Nonprobability Survey Samples, Journal of the American Statistical Association, 115:532, 2011-2021
Shu Yang, Jae Kwang Kim and Youngdeok Hwang Integration of data from probability surveys and big found data for finite population inference using mass imputation. Survey Methodology, June 2021 29 Vol. 47, No. 1, pp. 29-58
See Also
stats::optim() – For more information on the optim function used in the
optim method of propensity score model fitting.
maxLik::maxLik() – For more information on the maxLik function used in
maxLik method of propensity score model fitting.
ncvreg::cv.ncvreg() – For more information on the cv.ncvreg function used in
variable selection for the outcome model.
nleqslv::nleqslv() – For more information on the nleqslv function used in
estimation process of the bias minimization approach.
stats::glm() – For more information about the generalised linear models used during mass imputation process.
RANN::nn2() – For more information about the nearest neighbour algorithm used during mass imputation process.
control_sel() – For the control parameters related to selection model.
control_out() – For the control parameters related to outcome model.
control_inf() – For the control parameters related to statistical inference.
Examples
# generate data based on Doubly Robust Inference With Non-probability Survey Samples (2021)
# Yilin Chen , Pengfei Li & Changbao Wu
set.seed(123)
# sizes of population and probability sample
N <- 20000 # population
n_b <- 1000 # probability
# data
z1 <- rbinom(N, 1, 0.7)
z2 <- runif(N, 0, 2)
z3 <- rexp(N, 1)
z4 <- rchisq(N, 4)
# covariates
x1 <- z1
x2 <- z2 + 0.3 * z2
x3 <- z3 + 0.2 * (z1 + z2)
x4 <- z4 + 0.1 * (z1 + z2 + z3)
epsilon <- rnorm(N)
sigma_30 <- 10.4
sigma_50 <- 5.2
sigma_80 <- 2.4
# response variables
y30 <- 2 + x1 + x2 + x3 + x4 + sigma_30 * epsilon
y50 <- 2 + x1 + x2 + x3 + x4 + sigma_50 * epsilon
y80 <- 2 + x1 + x2 + x3 + x4 + sigma_80 * epsilon
# population
sim_data <- data.frame(y30, y50, y80, x1, x2, x3, x4)
## propensity score model for non-probability sample (sum to 1000)
eta <- -4.461 + 0.1 * x1 + 0.2 * x2 + 0.1 * x3 + 0.2 * x4
rho <- plogis(eta)
# inclusion probabilities for probability sample
z_prob <- x3 + 0.2051
sim_data$p_prob <- n_b* z_prob/sum(z_prob)
# data
sim_data$flag_nonprob <- as.numeric(runif(N) < rho) ## sampling nonprob
sim_data$flag_prob <- as.numeric(runif(n_b) < sim_data$p_prob) ## sampling prob
nonprob_df <- subset(sim_data, flag_nonprob == 1) ## non-probability sample
svyprob <- svydesign(
ids = ~1, probs = ~p_prob,
data = subset(sim_data, flag_prob == 1),
pps = "brewer"
) ## probability sample
## mass imputation estimator
mi_res <- nonprob(
outcome = y30 + y50 + y80 ~ x1 + x2 + x3 + x4,
data = nonprob_df,
svydesign = svyprob
)
mi_res
## inverse probability weighted estimator
ipw_res <- nonprob(
selection = ~ x1 + x2 + x3 + x4,
target = ~y30 + y50 + y80,
data = nonprob_df,
svydesign = svyprob
)
ipw_res
## doubly robust estimator
dr_res <- nonprob(
outcome = y30 + y50 + y80 ~ x1 + x2 + x3 + x4,
selection = ~ x1 + x2 + x3 + x4,
data = nonprob_df,
svydesign = svyprob
)
dr_res
Plots the Estimated Mean(s) and Their Confidence Interval(s)
Description
Simple plotting method that compares the estimated mean(s) and CI(s) with the naive (uncorrected) estimates.
Usage
## S3 method for class 'nonprob'
plot(x, ...)
Arguments
x |
the |
... |
other arguments passed to the plot method (currently not supported) |
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight,
strata = ~ size + nace + region, data = jvs)
ipw_est1 <- nonprob(selection = ~ region + private + nace + size,
target = ~ single_shift,
svydesign = jvs_svy,
data = admin, method_selection = "logit")
plot(ipw_est1)
Returns Population Size (Estimated or Fixed)
Description
Returns population size that is assumed to be
fixed– if it is based on thepop_sizeargument,estimated– if it is based on the probability survey specified in thesvydesignor based on the estimated propensity scores for the non-probability sample.
Usage
pop_size(object)
Arguments
object |
object returned by the |
Value
a scalar returning the value of the population size.
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight,
strata = ~ size + nace + region, data = jvs)
ipw_est1 <- nonprob(selection = ~ region + private + nace + size,
target = ~ single_shift,
svydesign = jvs_svy,
data = admin, method_selection = "logit"
)
ipw_est2 <- nonprob(
selection = ~ region + private + nace + size,
target = ~ single_shift,
svydesign = jvs_svy,
data = admin, method_selection = "logit",
control_selection = control_sel(est_method = "gee", gee_h_fun = 1))
## estimated population size based on the non-calibrated IPW (MLE)
pop_size(ipw_est1)
## estimated population size based on the calibrated IPW (GEE)
pop_size(ipw_est2)
Print Method for the Nonprob_summary Object
Description
Print method for the nonprob_summary object which allows for specification
what should be printed or not.
Usage
## S3 method for class 'nonprob_summary'
print(x, resid = TRUE, pred = TRUE, digits = 4, ...)
Arguments
x |
a |
resid |
whether distribution of residuals should be printed (default is |
pred |
whether distribution of predictions should be printed (default is |
digits |
number of digits to be printed (default 4) |
... |
further parameters passed to the print method (currently not supported) |
Summary Statistics for Models of the Nonprob Class
Description
Summarises the nonprob class object. The summary depends on the type of
the estimator (i.e. IPW, MI, DR)
Usage
## S3 method for class 'nonprob'
summary(object, ...)
Arguments
object |
object of the |
... |
Additional optional arguments |
Value
An object of nonprob_summary class containing:
-
callcall -
estimatortype of estimator -
controllist of controls -
ipw_weightsestimated IPW weights -
ipw_weights_totalestimated IPW total (sum) -
ps_scores_nonprobestimated propensity scores for non-probability sample -
ps_scores_probestimated propensity scores for probability sample -
case_weightscase weights -
outputestimated means and standard errors -
SEestimated standard errors of V1 and V2 -
confidence_intervalconfidence intervals -
nonprob_sizesize of the non-probability sample -
prob_sizesize of the probability sample -
pop_sizepopulation size -
pop_size_fixedwhether the population size is treated as fixed -
no_probwhether probability sample was provided -
outcomemodel details -
selectionselection details -
estimator_methodestimator method -
selection_formulaselection formula -
outcome_formulaoutcome formula -
vars_selectionwhether variable selection algorithm was applied -
vars_outcomevariables of the outcome models -
ys_rand_predpredicted values for the random sample (if applies) -
ys_nons_predpredicted values for the non-probability sample -
ys_residresiduals for the non-probability sample
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight,
strata = ~ size + nace + region, data = jvs)
ipw_est1 <- nonprob(selection = ~ region + private + nace + size,
target = ~ single_shift,
svydesign = jvs_svy,
data = admin, method_selection = "logit"
)
summary(ipw_est1)
Update Method for the Nonprob Object with Changed Arguments or Parameters
Description
The update method for the nonprob class object that allows to re-estimate
a given model with changed parameters. This is in particular useful if a user
would like to change method or estimate standard errors if they were not
estimated in the first place.
Usage
## S3 method for class 'nonprob'
update(object, ..., evaluate = TRUE)
Arguments
object |
the |
... |
arguments passed to the |
evaluate |
If true evaluate the new call else return the call |
Value
returns a nonprob object
Author(s)
Maciej Beręsewicz
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight,
strata = ~ size + nace + region, data = jvs)
ipw_est1 <- nonprob(selection = ~ region + private + nace + size,
target = ~ single_shift,
svydesign = jvs_svy,
data = admin, method_selection = "logit", se = FALSE
)
ipw_est1
update(ipw_est1, se = TRUE)
Extracts the Inverse Probability (Propensity Score) Weights
Description
A generic function weights that returns inverse probability weights (if present)
Usage
## S3 method for class 'nonprob'
weights(object, ...)
Arguments
object |
a |
... |
other arguments passed to methods (currently not supported) |
Value
A vector of weights or a NULL extracted from the nonprob object i.e. element "ipw_weights"
Examples
data(admin)
data(jvs)
jvs_svy <- svydesign(ids = ~ 1, weights = ~ weight,
strata = ~ size + nace + region, data = jvs)
ipw_est1 <- nonprob(selection = ~ region + private + nace + size,
target = ~ single_shift,
svydesign = jvs_svy,
data = admin, method_selection = "logit", se = FALSE
)
summary(weights(ipw_est1))