We remind that a random variable \(X\) following an admixture distribution has cumulative distribution function (cdf) \(L\) given by \[L(x) = pF(x) + (1-p)G(x), \qquad x \in \mathbb{R},\] where \(G\) is a mixture component whose distribution is perfectly known, whereas \(p\) and \(F\) are unknown. In this setting, if no parametric assumption is made on the unknown component distribution \(F\), the mixture is considered as a semiparametric mixture. For an overview on semiparametric extensions of finite mixture models, see (Xiang and Yang 2018).
The goal of this vignette is to introduce the functionalities that enable to perform hypothesis tests on the unknown component distribution \(F\). We aim to test whether \(F\) belongs to certain parametric family (e.g. the Gaussian one) in a 1-sample case, or if two different decontaminated versions of \(F_1\) and \(F_2\) (obtained from two observed samples \(X_1\) and \(X_2\)) are similar in the K-sample case (\(K \geq 2\)). All the specific tests presented hereafter can be performed using one single generic function for testing with appropriate arguments, the so-called \(admix\_test\) function.
In this setting, the test to be performed is a parametric family testing, i.e. \[H_0: \, F\in \mathcal{F} \qquad \mbox{against} \qquad H_1: \, F\notin \mathcal{F},\] where \(\mathcal{F}=\left\{F_\theta:~\theta\in \Theta \right\}\).
The support of the known component density has to be in line with the one of the unknown component density. Such tests have been introduced in (Pommeret and Vandekerkhove 2019), and the idea underlying this hypothesis test follows these steps:
Because of the use of asymptotically normal estimators, it is not possible to use the estimator provided in (Patra and Sen 2016) to perform hypothesis testing. On the contrary, Bordes and Vandekerkhove (2010) propose an estimator that can be used if the unknown component density is assumed to be symmetric. More generally, Pommeret and Vandekerkhove (2019) give more details about the distribution of the test statistic under the null (hypothesis \(H_0\)), and under the alternative \(H_1\).
Here, the implemented function allows to perform the so-called Gaussianity test, meaning that the parametric family against which the unknown component is tested belongs to Gaussian distributions. Below is an example of hypothesis testing in this 1-sample case:
####### Under the null hypothesis H0.
## Simulate mixture data:
mixt1 <- twoComp_mixt(n = 300, weight = 0.6,
                      comp.dist = list("norm", "norm"),
                      comp.param = list(c("mean" = 2, "sd" = 0.5),
                                        c("mean" = 0, "sd" = 1)))
data1 <- getmixtData(mixt1)
## Define the admixture model:
admixMod <- admix_model(knownComp_dist = mixt1$comp.dist[[2]],
                        knownComp_param = mixt1$comp.param[[2]])
admix_test(samples = list(data1), admixMod = list(admixMod), conf_level = 0.95,
           test_method = "poly", ask_poly_param = FALSE, support = "Real")
#> Call:admix_test(samples = list(data1), admixMod = list(admixMod), 
#>     test_method = "poly", conf_level = 0.95, ask_poly_param = FALSE, 
#>     support = "Real")
#> 
#> Is the null hypothesis H0 rejected? No
#> p-value of the test: 0.682The result of the test is that we cannot reject the null hypothesis \(H_0\), which is in line with the specified distribution for the unknown component. Indeed, simulated data is a Gaussian mixture with two components, i.e. \(F \sim \mathcal{N}(\mu,\sigma)\) where \(\mu=2\) and \(\sigma=0.5\).
Let us introduce two random samples \(X_1\) and \(X_2\) following admixture models, such that \[\begin{align*} \left\{ \begin{array}{l} L_1(x) = (1-p_1)G_1(x) + p_1F_1(x) \\ L_2(x) = (1-p_2)G_2(x) + p_2F_2(x), \end{array} \right. \end{align*}\]
The goal here is to perform the following hypothesis test: \[H_0: ~ F_1=F_2 \qquad \mbox{against} \qquad H_1: F_1\neq F_2.\]
In this framework, we assume that \(F_1\) and \(F_2\) both have a symmetric density. This way the normally-distributed estimator of \(p_1\) and \(p_2\), proposed in (Bordes and Vandekerkhove 2010), can be used together with the testing strategy suggested in (Milhaud et al. 2022). This testing strategy is closely connected to (Pommeret and Vandekerkhove 2019), where the computation of the expansion coefficients is duplicated on each of the two samples under study.
In what follows, we simulate two samples under the null and check whether the test provides satisfactory results.
mixt1 <- twoComp_mixt(n = 600, weight = 0.8,
                      comp.dist = list("norm", "norm"),
                      comp.param = list(list("mean" = 3, "sd" = 0.5),
                                        list("mean" = 0, "sd" = 1)))
mixt2 <- twoComp_mixt(n = 800, weight = 0.7,
                      comp.dist = list("norm", "norm"),
                      comp.param = list(list("mean" = 3, "sd" = 0.5),
                                        list("mean" = 0, "sd" = 1)))
data1 <- getmixtData(mixt1)
data2 <- getmixtData(mixt2)
## Define the admixture models:
admixMod1 <- admix_model(knownComp_dist = mixt1$comp.dist[[2]],
                         knownComp_param = mixt1$comp.param[[2]])
admixMod2 <- admix_model(knownComp_dist = mixt2$comp.dist[[2]],
                         knownComp_param = mixt2$comp.param[[2]])
## Using expansion coefficients in orthonormal polynomial basis:
admix_test(samples = list(data1,data2), admixMod = list(admixMod1,admixMod2), conf_level = 0.95,
           test_method = "poly", ask_poly_param = FALSE, support = "Real")
#> Call:admix_test(samples = list(data1, data2), admixMod = list(admixMod1, 
#>     admixMod2), test_method = "poly", conf_level = 0.95, ask_poly_param = FALSE, 
#>     support = "Real")
#> 
#> Is the null hypothesis H0 rejected? No
#> p-value of the test: 0.096The hypothesis test concludes that the null hypothesis cannot be rejected, once again in line with what was expected given the specified parameters when simulating the data.
Note that the following arguments, involved in subroutines, were set to default values (but that the user can choose them setting parameter ‘ask_poly_param’ to TRUE):
When the unknown component distributions are not supposed to have symmetric densities, another solution to perform the test is to set ‘est_method’ to ‘PS’ following ‘ask_poly_param’ set to TRUE, but keeping in mind that plugging in such estimators for the test procedure should not be allowed in theory. However, that works quite well in practice. Another solution to perform this test in full generality is to use the IBM method (see below).
Estimation of the unknown quantities is made by the Inversion - Best Matching approach, see (Milhaud et al. 2024b). In this case, one can still use the function with same first arguments except ‘method’, and ‘method’ should be set to ‘icv’. The user also has to define the number of simulated Gaussian processes used to tabulate the test statistic distribution (‘n_sim_tab’), and can accelerate computations using parallel computations and choosing an adequate number of cpus. Other arguments such as \(support\) are useless.
We introduce hereafter a natural extension of the two-sample case to the K-sample one, see (Milhaud et al. 2024a). In what follows, the K-sample test is illustrated within the framework of the IBM approach, i.e. using the associated inner convergence property. Of course, in the case when all the unknown component densities are assumed to be symmetric, one could use a pairwise version of the two sample test using the comparison of expansion coefficients in a polynomial orthonormal basis, associated to the estimation method provided by (Bordes and Vandekerkhove 2010).
Consider \(K\) samples. For \(i=1,...,K\), sample \(X^{(i)} = (X_1^{(i)}, ..., X_{n_i}^{(i)})\) follows \[L_i(x) = p_i F_i(x) + (1-p_i) G_i, \qquad x \in \mathbb{R}.\] The test to perform is given by \[H_0 : \; F_1 = ... = F_K \qquad \mbox{against} \qquad H_1: \; F_i \neq F_j \quad \mbox{for some} \quad i \neq j.\] We use the IBM approach to do so, where assumptions are (straightforwardly) adapted to deal with the \(K\) samples.
Basically, we apply the theoretical results of IBM for each pair of populations \((i,j)\), and then build a series of embedded statistics.
Consider the set of pair indices: \({\cal S}(K) = \{(i,j)\in \mathbb{N}^2 ; \; 1\leq i<j \leq K\}\).\ Order \({\cal S}(K)\) lexicographically, and denote \(r_K[(i,j)]\) the rank of \((i,j)\) in the set \(S(K)\).
Then, \(\forall i\neq j \in \{1,...,K\}\),
We then obtain \(d(K) = K(K-1)/2\) comparisons that we embed in a series of statistics: \[\begin{eqnarray*} U_1 &= &T_{1,2} \\ U_2 & = & T_{1,2}+T_{1,3} \\ &\vdots& \\ U_{d(K)} & = &T_{1,2} + \cdots + T_{K-1,K}, \end{eqnarray*}\]
To choose automatically the right order \(k\) for testing, consider the penalization rule (mimicking Schwarz criteria procedure, see (Schwarz 1978)): \[S(n) = \min \left\{\arg\max_{1 \leq k \leq d(K)} \left ({U}_{k} - k \sum_{(i,j)\in S(K)} l_n(i,j) \; 1_{\left\{r_K(i,j)=k\right\}} \right ) \right\}.\]
Our data-driven test statistic is given by \[\tilde{U}_n = U_{S(n)}.\]
It can be shown that under \(H_0\) and appropriate assumptions, \(S(n)\) converges in probablity towards 1 as \(n \rightarrow +\infty\); meaning that we asymptotically choose the first element of \({\cal S}(K)\).\ Moreover, under \(H_0\), \(U_{S(n)}\) converges in law towards \(U^{0}(1,2)\), which is exactly the null limit distribution studied in the two-sample case. Finally, we thus consider the \(H_0\)-rejection rule: \[\tilde{U}_n \quad \geq \quad \hat q_{1-\alpha} \qquad \Rightarrow \qquad H_0 \; \mbox{is rejected}.\]
We now provide the way to perform this test with the package \(admix\) with Gaussian mixtures. First, let us study the case where we are under the null hypothesis \(H_0\), considering \(K=3\) different populations.
mixt1 <- twoComp_mixt(n = 450, weight = 0.4,
                      comp.dist = list("norm", "norm"),
                      comp.param = list(c("mean" = -2, "sd" = 0.5),
                                        c("mean" = 0, "sd" = 1)))
mixt2 <- twoComp_mixt(n = 600, weight = 0.24,
                      comp.dist = list("norm", "norm"),
                      comp.param = list(c("mean" = -2, "sd" = 0.5),
                                        c("mean" = -1, "sd" = 1)))
mixt3 <- twoComp_mixt(n = 400, weight = 0.53,
                      comp.dist = list("norm", "norm"),
                      comp.param = list(c("mean" = -2, "sd" = 0.5),
                                        c("mean" = 2, "sd" = 1)))
data1 <- getmixtData(mixt1)
data2 <- getmixtData(mixt2)
data3 <- getmixtData(mixt3)
admixMod1 <- admix_model(knownComp_dist = mixt1$comp.dist[[2]],
                         knownComp_param = mixt1$comp.param[[2]])
admixMod2 <- admix_model(knownComp_dist = mixt2$comp.dist[[2]],
                         knownComp_param = mixt2$comp.param[[2]])
admixMod3 <- admix_model(knownComp_dist = mixt3$comp.dist[[2]],
                         knownComp_param = mixt3$comp.param[[2]])
admix_test(samples = list(data1, data2, data3), 
           admixMod = list(admixMod1, admixMod2, admixMod3), 
           conf_level = 0.95, test_method = "icv", n_sim_tab = 4, 
           tune_penalty = FALSE, parallel = FALSE, n_cpu = 2)
#> Call:admix_test(samples = list(data1, data2, data3), admixMod = list(admixMod1, 
#>     admixMod2, admixMod3), test_method = "icv", conf_level = 0.95, 
#>     n_sim_tab = 4, tune_penalty = FALSE, parallel = FALSE, n_cpu = 2)
#> 
#> Is the null hypothesis H0 rejected? No
#> p-value of the test: 0.667