Note the examples are not actually run but their results are stored in the included data set examples.mdgof.vignette. This is done to satisfy CRAN submission rules regarding the execution time of vignettes. If you want to run the Rmarkdown file yourself, set ReRunExamples=TRUE.
This package brings together a number of routines for the goodness-of-fit problem for multivariate data. There is a data set \(\mathbf{x}\) and a cumulative distribution function \(F\) (possibly depending on parameters that have to be estimated from x), and we want to test \(H_0:\mathbf{X}\sim F\).
The highlights of this package are:
For descriptions of the included method see the vignette MDgof-Methods.
We generate a two-dimensional data set of size 100 from a multivariate standard normal distribution and run the test with the null hypothesis \(H_0:X\sim N(\mathbf{0}, \mathbf{I_2})\):
# cumulative distribution function under the null hypothesis
pnull=function(x) {
f=function(x) mvtnorm::pmvnorm(rep(-Inf, length(x)), x)
if(!is.matrix(x)) return(f(x))
apply(x, 1, f)
}
# function that generates data under the null hypothesis
rnull=function() mvtnorm::rmvnorm(100, c(0, 0))
x=rnull()examples[["ex1a"]]
#> $statistics
#> qKS qK qCvM qAD BR MMD ES EP
#> 0.09020 0.14800 0.09610 0.56700 3.54100 0.00457 10.30000 12.17000
#>
#> $p.values
#> qKS qK qCvM qAD BR MMD ES EP
#> 0.5650 0.4400 0.6150 0.9050 0.4100 0.6950 0.4148 0.4319\(B\) is the number of simulation runs to estimated the p-value, \(B=5000\) is the default but if the data set is large a smaller value should suffice.
If the density of the distribution under the null hypothesis is also known it can be included and two more tests can be run:
dnull=function(x) {
f=function(x) mvtnorm::dmvnorm(x)
if(!is.matrix(x)) return(f(x))
apply(x, 1, f)
}examples[["ex1b"]]
#> $statistics
#> qKS qK qCvM qAD BB BR MMD KSD
#> 0.09020 0.14800 0.09610 0.56700 0.00134 3.64500 0.00790 0.97800
#> ES EP
#> 10.30000 12.17000
#>
#> $p.values
#> qKS qK qCvM qAD BB BR MMD KSD ES EP
#> 0.5500 0.5000 0.5900 0.9250 1.0000 0.4700 0.4500 0.8100 0.4148 0.4319Arguments of gof_test for continuous data
In this example we will test whether the data comes from bivariate normal distribution, with means and variance-covariance estimated from the data.
pnull=function(x, p) {
f=function(x) mvtnorm::pmvnorm(rep(-Inf, length(x)),
x, mean=p[1:2],
sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2))
if(!is.matrix(x)) return(f(x))
apply(x, 1, f)
}
rnull=function(p) mvtnorm::rmvnorm(200,
mean=p[1:2],
sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2))
dnull=function(x, p) {
f=function(x) mvtnorm::dmvnorm(x,
mean=p[1:2],
sigma=matrix(c(p[3], p[5], p[5], p[4]), 2, 2))
if(!is.matrix(x)) return(f(x))
apply(x, 1, f)
}
phat=function(x)
c(apply(x, 2, mean), c(apply(x, 2, var), cov(x[,1],x[,2])))
x=rnull(c(1, 2, 4, 9,0.6))
phat(x)
#> [1] 1.2812099 1.8038581 3.9441709 8.0894505 0.1389893examples[["ex2a"]]
#> $statistics
#> qKS qK qCvM qAD BB BR MMD KSD ES EP
#> 0.08060 0.14100 0.05590 0.48000 0.00607 4.56200 0.00121 0.15500 3.96100 8.39000
#>
#> $p.values
#> qKS qK qCvM qAD BB BR MMD KSD ES EP
#> 0.3100 0.2850 0.5900 0.8700 0.5450 0.7950 0.9700 0.4700 0.6819 0.2995In contrast in the next case we will generate data from a multivariate t distribution with 5 degrees of freedom:
gof_test allows the user to supply their own test method, either one that finds its own p-value or a method that requires simulation.
As an example the routine newTS (included in the package) finds either the test statistic or the p value of a chi square test in two dimensions for either continuous or discrete data.
In this example we use the same setup as in example 2. newTS requires the following object TSextra (a list):
examples[["ex3a"]]
#> $statistics
#> ES33 EP33
#> 0.100 0.306
#>
#> $p.values
#> ES33 EP33
#> 0.49 0.33examples[["ex3b"]]
#> $statistics
#> ES33 EP33
#> 0.000000 0.000118
#>
#> $p.values
#> ES33 EP33
#> 0.975 1.000Arguments and output of new test routine for continuous data
The arguments have to be:
The output vector of the routine has to be a named vector. If the routine is written in Rcpp parallel programming is not available.
If several tests are run one can use the routine gof_test_adjusted_pvalue to find a p value that is adjusted for simultaneous testing:
a=examples[["ex3c"]]
for(i in seq_along(a))
print(paste(names(a)[i],": ", round(a[i],4)))
#> [1] "qKS : 0.8502"
#> [1] "qK : 0.7295"
#> [1] "qCvM : 0.8744"
#> [1] "qAD : 0.8261"
#> [1] "BB : 0.7101"
#> [1] "BR : 0.0193"
#> [1] "MMD : 0.628"
#> [1] "KSD : 0.5411"
#> [1] "ES : 0.7158"
#> [1] "EP : 0.7212"
#> [1] "Min p : 0.1871"The BR test had the smallest p-value (\(0.0193\)), which is adjusted to \(0.1817\) to account for the fact that \(10\) tests where run simultaneously.
The routine gof_power allows the user to estimate the power of the tests.
Let’s say we want to estimate the power when the null hypothesis specifies a standard bivariate normal distribution, but the data actually comes from a bivariate normal distribution with mean vector \((0,0)\), variances equal to 1 and a covariance \(\rho\). We first need a function that generates these data sets:
pnull=function(x) {
f=function(x) mvtnorm::pmvnorm(rep(-Inf, length(x)), x)
if(!is.matrix(x)) return(f(x))
apply(x, 1, f)
}
rnull=function() mvtnorm::rmvnorm(100, c(0, 0))
dnull=function(x) {
f=function(x) mvtnorm::dmvnorm(x)
if(!is.matrix(x)) return(f(x))
apply(x, 1, f)
}
ralt=function(s) mvtnorm::rmvnorm(100, c(0, 0),
sigma=matrix(c(1, s, s, 1), 2, 2))Now we can run
examples[["ex4a"]]
#> qKS qK qCvM qAD BB BR MMD KSD ES EP
#> 0 0.080 0.045 0.035 0.065 0.075 0.035 0.045 0.105 0.06 0.075
#> 0.8 0.775 0.460 0.900 0.950 0.975 0.585 0.925 1.000 1.00 1.000Arguments of gof_power
Again the user can provide their own routine:
First note that tests for discrete data are implemented only in two dimensions. The data set is a matrix with three columns. Each row has a specific value for the x variable, the y variable and the corresponding counts.
We consider the case of data from binomial distributions:
pnull=function(x) {
f=function(x)
sum(dbinom(0:x[1], 5, 0.5)*pbinom(x[2], 3, 0.5+0:x[1]/20))
if(!is.matrix(x)) return(f(x))
apply(x, 1, f)
}
rnull=function() {
x=rbinom(1000, 5, 0.5)
y=rbinom(1000, 3, 0.5+x/20)
MDgof::sq2rec(table(x, y))
}
x=rnull()
x
#> [,1] [,2] [,3]
#> [1,] 0 0 0
#> [2,] 1 0 15
#> [3,] 2 0 23
#> [4,] 3 0 11
#> [5,] 4 0 3
#> [6,] 5 0 0
#> [7,] 0 1 10
#> [8,] 1 1 52
#> [9,] 2 1 95
#> [10,] 3 1 76
#> [11,] 4 1 30
#> [12,] 5 1 4
#> [13,] 0 2 16
#> [14,] 1 2 63
#> [15,] 2 2 122
#> [16,] 3 2 134
#> [17,] 4 2 69
#> [18,] 5 2 7
#> [19,] 0 3 0
#> [20,] 1 3 28
#> [21,] 2 3 69
#> [22,] 3 3 90
#> [23,] 4 3 68
#> [24,] 5 3 15examples[["ex5"]]
#> $statistics
#> qKS qK qCvM qAD ChiSquare
#> 0.02820 0.03980 0.00237 0.01290 13.06000
#>
#> $p.values
#> qKS qK qCvM qAD ChiSquare
#> 0.2050 0.1750 0.4000 0.7350 0.8355The MDgof routine sq2rec above changes the format of the data from that output by the table command to what is required by gof_test.
The arguments of gof_test for discrete data are the same as those for continuous data. The routines determine that the data is discrete if x is a matrix with three columns and the the values in the third column are integers.
One use of the discrete data methods is if the data is actually continuous but the data set is very large, so that the continuous methods take to long to run. In that case one can discretize the data and run a discrete method. In this example we have one million observations from a bivariate random variable and we want to test whether they are from independent dependent Beta distributions, with a parameter to be estimated from the data. The null hypothesis specifies the model \(X\sim Beta(a, a)\), \(Y|X=x\sim Beta(x, x)\), with the parameter \(a\). Here is an example of data drawn from this model, using \(a=2\):
rnull_cont=function(p) {
x=rbeta(250, p, p)
y=rbeta(250, x, x)
cbind(x=x, y=y)
}
dta_cont=rnull_cont(2)
ggplot2::ggplot(data=as.data.frame(dta_cont), ggplot2::aes(x,y)) +
ggplot2::geom_point()Here we have the joint density
\[ \begin{aligned} &f(x, y) = f_X(x)f_{Y|X=x}(y|x) = \\ &\frac{\Gamma(2a)}{\Gamma(a)^2}[x(1-x)]^a \frac{\Gamma(2x)}{\Gamma(x)^2}[y(1-y)]^x \end{aligned} \]
First we need an estimator of \(a\). We can find the maximum likelihood estimator using the Newton-Raphson algorithm:
\[ \begin{aligned} &l(a) = n\left[\log \Gamma(2a)-2\log \Gamma(a)+\frac{a}{n}\sum\log [x_i(1-x_i)]\right]\\ &\frac{dl}{da}=2n\left[di(2a)-di(a)+\frac{1}{2n}\sum\log [x_i(1-x_i)]\right]\\ &\frac{d^2l}{da^2}=2n\left[2 tri(2a)-tri(a)\right]\\ \end{aligned} \] where \(di\) and \(tri\) are the first and second derivative of the log gamma function. Here is an implementation
phat_cont=function(x) {
n=nrow(x)
sx=sum( log(x[,1]*(1-x[,1])) )/2/n
num=function(a) digamma(2*a)-digamma(a)+sx
denom=function(a) 2*trigamma(2*a)-trigamma(a)
anew=2
repeat {
aold=anew
anew=aold-num(aold)/denom(aold)
if(abs(aold-anew)<0.001) break
}
anew
}
phat_cont(dta_cont)
#> [1] 2.063892For the function \(pnull\) we make use of the following general calculation:
\[ \begin{aligned} &F(x,y) = P(X\le x;Y\le y) =\\ &\int_{-\infty}^x \int_{-\infty}^y f(u,v) dvdu = \\ &\int_{-\infty}^x \int_{-\infty}^y f_X(u)f_{Y|X=u}(v|u) dvdu = \\ &\int_{-\infty}^x f_X(u) \left\{\int_{-\infty}^y f_{Y|X=u}(v|u) dv\right\}du = \\ &\int_{-\infty}^x f_X(u)F_{Y|X=u}(v|u) dv \end{aligned} \]
which is implemented in
pnull=function(x, p) {
f=function(x) {
g=function(u) dbeta(u, p, p)*pbeta(x[2], u, u)
integrate(g, 0, x[1])$value
}
if(!is.matrix(x)) return(f(x))
apply(x, 1, f)
}and with this we can run the test
examples[["ex6a"]]=gof_test(dta_cont, pnull, rnull_cont,
phat=phat_cont,
Ranges=matrix(c(0, 1,0, 1), 2, 2),
maxProcessor = 1, B=B)examples[["ex6a"]]
#> $statistics
#> qKS qK qCvM qAD BR MMD ES EP
#> 0.06720 0.11100 0.20100 1.57400 1.82900 0.00286 28.19000 30.08000
#>
#> $p.values
#> qKS qK qCvM qAD BR MMD ES EP
#> 0.3000 0.1200 0.1600 0.1400 0.6600 0.4000 0.1349 0.0904Now, though, let’s assume that the data set has one million observations. In that case the routine above will be much to slow. Instead we can discretize the problem. In order to make this work we will have to discretize all the routines, except for pnull as the distribution function is the same for discrete and continuous random variables.
Let’s say we decide to discretize the data into a 50x30 grid of equal size bins. As in example 5 we can generate the data using the Binomial distribution. However, we will also have to find the bin probabilities, essentially the density of the discretized random vector. This is done in
rnull_disc=function(p) {
pnull=function(x, p) {
f=function(x) {
g=function(u) dbeta(u, p, p)*pbeta(x[2], u, u)
integrate(g, 0, x[1])$value
}
if(!is.matrix(x)) return(f(x))
apply(x, 1, f)
}
nb=c(50, 30)
xvals=1:nb[1]/nb[1]
yvals=1:nb[2]/nb[2]
z=cbind(rep(xvals, nb[2]), rep(yvals, each=nb[1]), 0)
prob=c(t(MDgof::p2dC(z, pnull, p)))
z[, 3]=rbinom(prod(nb), 1e6, prob)
z
}
dta_disc=rnull_disc(2)
dta_disc[1:10, ]
#> [,1] [,2] [,3]
#> [1,] 0.02 0.03333333 20196
#> [2,] 0.04 0.03333333 33428
#> [3,] 0.06 0.03333333 0
#> [4,] 0.08 0.03333333 527
#> [5,] 0.10 0.03333333 592
#> [6,] 0.12 0.03333333 20160
#> [7,] 0.14 0.03333333 66612
#> [8,] 0.16 0.03333333 0
#> [9,] 0.18 0.03333333 566
#> [10,] 0.20 0.03333333 6The MDgof routine \(p2dC\) finds the probabilities of rectangles defined by the grid from the distribution function.
Next we need to rewrite the routine that finds the estimator, now based on the discrete data:
phat_disc=function(x) {
nb=c(50, 30)
n=sum(x[,3]) #sample size
valsx=unique(x[,1])-1/nb[1]/2#x values, center of bins
x=tapply(x[,3], x[,1], sum) # counts for x alone
sx=sum(x*log(valsx*(1-valsx)))/2/n
num=function(a) digamma(2*a)-digamma(a)+sx
denom=function(a) 2*trigamma(2*a)-trigamma(a)
anew=2
repeat {
aold=anew
anew=aold-num(aold)/denom(aold)
if(abs(aold-anew)<0.00001) break
}
anew
}
p=phat_disc(dta_disc)
p
#> [1] 1.083052We want to apply this test to the original continuous data set, so we need to discretize it as well. This can be done with the MDgof routine discretize. Then we can run the test
set.seed(111)
x=rbeta(1e6, 2, 2)
y=rbeta(1e6, x, x)
dta_cont=cbind(x=x, y=y)
dta_disc=MDgof::discretize(dta_cont,
Range=matrix(c(0,1,0,1),2,2),
nbins=c(50, 30))
head(dta_disc)
#> [,1] [,2] [,3]
#> [1,] 0.02 0.03333333 613
#> [2,] 0.04 0.03333333 1555
#> [3,] 0.06 0.03333333 2396
#> [4,] 0.08 0.03333333 3144
#> [5,] 0.10 0.03333333 3690
#> [6,] 0.12 0.03333333 4156The routine newTS (included in package) does a chi-square test for discrete data:
Power estimation for discrete data is done the same way as for continuous data:
We consider again the case of data from binomial distributions as in example 5. As an alternative we change the distribution of Y.
pnull=function(x) {
f=function(x)
sum(dbinom(0:x[1], 5, 0.5)*pbinom(x[2], 3, 0.5+0:x[1]/20))
if(!is.matrix(x)) return(f(x))
apply(x, 1, f)
}
rnull=function() {
x=rbinom(1000, 5, 0.5)
y=rbinom(1000, 3, 0.5+x/20)
MDgof::sq2rec(table(x, y))
}
ralt=function(p) {
x=rbinom(1000, 5, p)
y=rbinom(1000, 3, 0.5+x/20)
MDgof::sq2rec(table(x, y))
}
x=ralt(0.475)examples[["ex7a"]]
#> $statistics
#> qKS qK qCvM qAD ChiSquare
#> 0.0480 0.0532 0.0107 0.0843 29.4600
#>
#> $p.values
#> qKS qK qCvM qAD ChiSquare
#> 0.0000 0.0200 0.0050 0.0000 0.0591examples[["ex7b"]]
#> qKS qK qCvM qAD Chisquare
#> 0.5 0.065 0.075 0.060 0.050 0.06
#> 0.475 0.845 0.620 0.795 0.715 0.48or using a user-supplied test:
The routine run.studies allows the user to quickly compare the power of a new test with the power of the included tests for 30 different combinations of null hypothesis vs alternative for continuous or discrete data and with or without parameter estimation. It also allows the user to find the power for those case studies for different sample size and type I error probabilities.
Let’s say that we want to determine whether method A or B has better power for a specific case of null hypothesis and alternative. Then if we find the powers of the tests for (say) \(n=100,\alpha=0.05\) and find that method A is better, A will aslo be better for any other combination of \(n\) and \(\alpha\). For this reason just checking one suffices to detemine the ranking of the methods (for this one case).
Say we want to compare the power of newTS for continuous data with parameter estimation with the powers of the included tests:
a=run.studies(study=1:2, # do first two
Continuous=TRUE,
WithEstimation = TRUE,
TS=newTS,
TSextra=TSextra,
With.p.value=TRUE,
B=B)Arguments of run.studies
MDgof includes five data sets with the results for the included tests using a sample size of 250, a true type I error probability of 0.05 and two values of the parameter under the alternative, one which makes the null hypothesis true and one which makes it false. They are
If the user wants different numbers he can run:
examples[["ex8"]]= run.studies(1:2,
Continuous=TRUE, WithEstimation = FALSE,
param_alt=cbind(c(0, 0.4), c(0, 0.4)),
alpha=0.1, nsample=500, B=B)examples[["ex8"]][["Null"]][1:2, ]
#> qKS qK qCvM qAD BB BR MMD KSD ES EP
#> uniform.diagonal.n 0.082 0.114 0.133 0.116 0.097 0.114 0.121 0.099 0.125 0.13
#> uniform.diagonal.b 0.949 0.978 0.990 1.000 0.942 0.998 0.986 0.961 1.000 1.00
examples[["ex8"]][["Pow"]][1:2, ]
#> qKS qK qCvM qAD BB BR MMD KSD ES EP
#> uniform.diagonal.n 0.082 0.114 0.133 0.116 0.097 0.114 0.121 0.099 0.07 0.085
#> uniform.diagonal.b 0.949 0.978 0.990 1.000 0.942 0.998 0.986 0.961 1.00 1.000