PUlasso is an algorithm for parameter estimation and classification using Positive and Unlabelled(PU) data. More concretely, presented with two sets of sample such that the first set consisting of \(n_l\) positive and labelled observations and a second set containing \(n_u\) observations randomly drawn from the population with only covariates not the responses being observed and the true positive prevalence information \(P(Y=1)\), PUlasso algorithm models the relationship between the probability of a response being positive and \((x, \theta)\) using the standard logistic regression model:\(P_\theta(y=1|x) := 1/(1+exp(-\theta^Tx))\) and solves the following optimization problem: \[\begin{equation}\label{objFunc} \underset{\theta}{\operatorname{argmin}} -\log L(\theta;x,z) + P_\lambda(\theta) \end{equation}\] where \(\log L(\theta;x,z)\) is an observed log-likelihood based on covariates and labels \((x,z)\), and \(P_\lambda(\theta)\) is \(\ell_1\) or \(\ell_1/\ell_2\) penalty. For more detailed discussion, please see our paper https://arxiv.org/abs/1711.08129
We demonstrate basic usage of PUlasso package using simulated PU data.
This loads simulPU object which contains the input matrix \(X\), labels \(z\), true (latent) responses \(y\), and the positive prevalence true\(PY1\). We first visualize the data. We plot
the first two columns \(X1-X2\) colored
by \(z\) or \(y\) as the first two variables are set to
be active in the simulation. From the following plots, we see that many
positive samples are marked as unlabelled. For more details about the
simulation setting, do help(simulPU)
We fit the model using the most basic call. By default it fits the model for 100 values of \(\lambda\) starting from the null model with lasso penalty.
(fit=grpPUlasso(X=simulPU$X,z=simulPU$z,py1=simulPU$truePY1))
#>
#> Call: grpPUlasso(X = simulPU$X, z = simulPU$z, py1 = simulPU$truePY1)
#>
#> Optimization Method: "CD"
#>
#> Max nIters: 1000
coefficients can be extracted using coef
function. Here
we extract estimated coefficients for 30th \(\lambda\). By default, coefficients are
returned in an original scale. If desired, we can set
std.scale=T
to obtain coefficients in the standardizeds
scale.
coef(fit, lambda=fit$lambda[30])
#> [,1]
#> (Intercept) 0.05544055
#> X1 0.30819669
#> X2 0.36584501
#> X3 0.00000000
#> X4 0.00000000
#> X5 0.00000000
If we want to predict responses at certain \(x\), we use the predict
function. By default, it returns estimated probabilities.
xnew = matrix(rnorm(10),2,5)
predict(fit,newdata = xnew,lambda = fit$lambda[30])
#> [,1]
#> [1,] 0.5517827
#> [2,] 0.3281259
It is a common practice to choose \(\lambda\) based on cross-validation. Main
function for the k-fold cross-validation is
cv.grpPUlasso
.
(cv.fit = cv.grpPUlasso(X=simulPU$X,z=simulPU$z,py1=simulPU$truePY1))
#>
#> Call: cv.grpPUlasso(X = simulPU$X, z = simulPU$z, py1 = simulPU$truePY1)
#>
#> Optimization Method: "CD"
#>
#> Max nIters: 1000
We use deviance for a measure of model fit. Average deviance and
standard error of deviance over all \(k\)-folds for all \(\lambda\) values saved in
cv.fit$cvm
and cv.fit$cvsd
, respectively. We
are particularly interested in two lambda values : lambda.min which
gives the minimum mean cross-validation deviance, and lambda.1se which
corresponds to the largest \(\lambda\)
such that cvm is within one standard error of the minimum. We can also
extract coefficients corresponding to such lambda levels.
coef(cv.fit,lambda=cv.fit$lambda.1se)
#> [,1]
#> (Intercept) 0.24831182
#> X1 0.39536496
#> X2 0.48877381
#> X3 0.00000000
#> X4 0.07366768
#> X5 0.00000000
We finalize this section by demonstrating how to do a classification based on fitted models. A natural threshold of .5 is applied for a classification. We plot \(X1-X2\) colored by \(\hat{y}\) to check classification performances.
phat<-predict(cv.fit,newdata = simulPU$X,lambda = cv.fit$lambda.1se,type = "response")
yhat<-1*(phat>0.5)
We can also use a group lasso penalty. Suppose \(X1\) is in group 1, \(X2-X3\) are in group 2, and \(X4-X5\) are in group 3. We only need to
provide a membership information using an additional vector, here named
as a grpvec
.
grpvec = c(1,2,2,3,3)
fit.grp = grpPUlasso(X=simulPU$X,z=simulPU$z,py1=simulPU$truePY1,group = grpvec)
All members in the group are either all included or not included. For example, we see from 13th \(\lambda\) to 14th \(\lambda\), members in group 2 are entered the model together.
coef(fit.grp,fit.grp$lambda[12:15])
#> [,1] [,2] [,3] [,4]
#> (Intercept) -0.02699613 -0.02687234 -0.031632001 -0.036079220
#> X1 0.32136211 0.33761227 0.341612439 0.342443679
#> X2 0.00000000 0.00000000 0.018538324 0.039900657
#> X3 0.00000000 0.00000000 -0.001038037 -0.002210257
#> X4 0.00000000 0.00000000 0.000000000 0.000000000
#> X5 0.00000000 0.00000000 0.000000000 0.000000000
PUlasso can exploit a sparsity in an input matrix for a more
efficient calculation. If dgCMatrix
type of the input
matrix is provided, PUlasso automatically performs a sparse
calculation.
For a simple demonstration, here we create dgCMatrix
objects based on \(X\). First we create
a sparse matrix sparseX by imposing 0 on 95% of the entries of \(X\).
sparseX <- simulPU$X
sparseX[sample(1:length(simulPU$X),size = length(simulPU$X)*0.95)]<-0
sparseX<-Matrix(sparseX)
class(sparseX)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
Those input matrices can be used in the same way. For example,
(spfit<-grpPUlasso(sparseX,simulPU$z,simulPU$truePY1))
#>
#> Call: grpPUlasso(X = sparseX, z = simulPU$z, py1 = simulPU$truePY1)
#>
#> Optimization Method: "CD"
#>
#> Max nIters: 1000
newx = matrix(rnorm(10),2,5)
predict(spfit,newdata = newx,lambda = spfit$lambda[10])
#> [,1]
#> [1,] 0.4705057
#> [2,] 0.5020064
By default, PUlasso uses a block-coordinate descent algorithm to solve optimization problem \(\ref{objFunc}\). However, it provides other optimization options such as proximal full gradient descent(GD), stochastic gradient descent(SGD), and variants of SGD including stochastic variance-reduced gradient(SVRG) and stochastic average gradient(SAG). For example, using SVRG method, we fit,
(fit.SVRG = grpPUlasso(X=simulPU$X,z=simulPU$z,py1=simulPU$truePY1,method="SVRG",eps = 1e-6,lambda =fit$lambda[2]))
#>
#> Call: grpPUlasso(X = simulPU$X, z = simulPU$z, py1 = simulPU$truePY1, lambda = fit$lambda[2], eps = 1e-06, method = "SVRG")
#>
#> Optimization Method: "SVRG"
#>
#> Max nIters: 20000
By default, PUlasso does 10*number of observations iterations. The algorithm terminates early if the difference of current parameter value from the previous one is less than the eps.