lpda: Linear Programming Discriminant Analysis

1. The method

lpda is an R package that addresses the classification problem through linear programming. The method looks for a hyperplane, H, which separates the samples into two groups by minimizing the sum of all the distances to the subspace assigned to the group each individual belongs to. It results in a convex optimization problem for which we find an equivalent linear programming problem. We demonstrated that H exists when the centroids of the two groups are not equal [1]. The method has been extended to more than two groups by considering pairwise comparisons. Moreover, lpda offers the possibility of dealing with Principal Components (PCs) to reduce the dimension of the data avoiding overfitting problems. This option can be applied independently of the number of samples, $n$, and variables, $p$, that is $n>p$ or $n<p$. Compared to other similar techniques it is very fast, mainly because it is based in a linear programming problem [2].

Recently, the method has been adapted to 3-dimensional data [4].

2. The package

library(lpda)

Main function is lpda that collect the input data, standardizes the data or applies Principal Component Analysis (PCA) through lpda.pca if it is required. Then, it calls to lpda.fit as many times as pairwise comparisons there are. The result is a lpda type object that is the input to compute predictions through ‘predict’ function and to visualize results through ‘plot’ function.

The package has also a function named lpdaCV to compute by crossvalidation (CV) the classification error in different test data sets. This is helpful to decide an appropriate number of PCs or a specific strategy to compute the hyperplane.

lpda.3D and lpdaCV.3D are designed for dealing with 3-dimensional data.

3. The data

lpda package includes two data sets concerning data science: palmdates and RNAseq. The first one is a real data set from a chemometrics study and the second one a simulated RNA-seq experiment. In this document we show the performance of the package with these data sets and with iris data, available in R package.

Palmdates data

Palmdates is a data set with scores of 21 palm dates including their respective Raman spectra and the concentration of five compounds covering a wide range of concentrations: fibre, glucose, fructose, sorbitol and myo-inositol [3]. The first 11 dates are Spanish (from Elche, Alicante) with no well-defined variety and the last 10 are from other countries and varieties, mainly Arabian. The data set has two data.frames: conc with 5 variables and spectra with 2050.

data("palmdates")
names(palmdates)
#> [1] "conc"    "spectra"
dim(palmdates$spectra)
#> [1]   21 2050
names(palmdates$conc)
#> [1] "fibre"        "sorbitol"     "fructose"     "glucose"      "myo-inositol"

As conc and spectra, are very correlated, the application of the method with PCs reduces substantially the dimension.

RNAseq data

This data set has been simulated as Negative Binomial distributed and transformed to rpkm (Reads per kilo base per million mapped reads). It is a 3-dimensional array, that contains gene expression from 600 genes measured to 60 samples through 4 time-points. First 10 samples are from first group and the remaining samples from the second one. It has been simulated with few variables (genes) that discriminate between groups. There is few correlation and a lot of noise.

data("RNAseq")
dim(RNAseq)
#> [1]  20 600   4
head(RNAseq[,1:6,1:2])
#> , , 1
#> 
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]   70  129   81   86  169  179
#> [2,]   12  153   80  155   85   95
#> [3,]  217   68  136  112  179   96
#> [4,]  109   89   73  194  158  148
#> [5,]   58   92   95   40   59   55
#> [6,]  102   41  188  121   99   95
#> 
#> , , 2
#> 
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]   65   94   95  195  164   95
#> [2,]  101  117  119   92   59  116
#> [3,]   55   67   64   24   54  104
#> [4,]   48   92   68   95   74   86
#> [5,]  221   65   85   81   52   45
#> [6,]   45   13   97   70   76  101

4. `lpda` function

Palmdates: Chemical variables

First we apply the method with the first two variables: fibre and sorbitol to show the performance of the package. The application of the method with two variables allows the visualization of the hyperplane in two dimensions, in this case, a straight line.

   group = as.factor( c(rep("Spanish",11), rep("Other",10)) )
   model1 = lpda(data = palmdates$conc[,1:2], group = group )
   model1
#> Call:
#> lpda(data = palmdates$conc[, 1:2], group = group)
#> Coefficients: 
#> [1] -1.116131 -3.002923 -6.563646
   names(model1)
#> [1] "coef"  "data"  "group" "scale" "pca"   "call"

One of the outputs of lpda is a matrix with the coefficients of the hyperplane: $a'x=b$ for each pair-wise comparison. In this example, -1.116 and -3.003 the coefficients of fibre and sorbitol respectively and -6.564 the constant $b$.

In cases with 2 variables, as this simple example, we can plot the line on the points, that represents the samples, with the following code:

   plot(palmdates$conc[,1:2], col = as.numeric(group)+1, pch = 20, 
   main = "Model 1. Palmdates: fibre & sorbitol")
   abline(model1$coef[3]/model1$coef[2], -model1$coef[1]/model1$coef[2], cex = 2)
   legend("bottomright", c("Other","Spanish"),col = c(2,3), pch = 20, cex=0.8)

Predicted group with the model is obtained with predict function. Confusion matrix can be obtained with summary. We observe that all the samples are well classified.

predict(model1)
#> 
#>              elche1              elche2              elche3              elche4 
#>           "Spanish"           "Spanish"           "Spanish"           "Spanish" 
#>              elche5              elche6              elche7              elche8 
#>           "Spanish"           "Spanish"           "Spanish"           "Spanish" 
#>              elche9             elche10             elche11      salomon.israel 
#>           "Spanish"           "Spanish"           "Spanish"             "Other" 
#>       hayani.israel      medjool.israel        barhi.israel   deglet.noor.tunez 
#>             "Other"             "Other"             "Other"             "Other" 
#>        tunez.alligh argelia.deglet.noor                iran  arabia.saudi.perny 
#>             "Other"             "Other"             "Other"             "Other" 
#>  sud.africa.medjool 
#>             "Other"
summary(model1)
#> Call:
#> lpda(data = palmdates$conc[, 1:2], group = group)
#> 
#> CONFUSION MATRIX
#>          Real
#> Predicted Other Spanish
#>   Other      10       0
#>   Spanish     0      11

By considering the five concentration variables, all samples are well classified as well. As a 5-dimensional plot can not be showed, we offer the possibility of seeing a plot that shows the situation of samples with respect to $H$. Y-axis represents order in which they appear in the data matrix and X-axis distances of each sample to $H$.

   model2 = lpda(data = palmdates$conc, group = group )
   plot(model2, main="Model 2. Palmdates: all conc variables")

Another option is the application of lpda to the first PCs scores. When data is highly correlated, two components are usually preferred. In such case choosing as plot arguments PCscores = TRUE we will visualize the first two PCs, indicating in the axis the proportion of explained variance by each PC. Moreover if there are two groups, as in this example, the optimal hyperplane is also showed.

model3 = lpda(data = palmdates$conc, group = group, pca = TRUE, Variability = 0.7)
plot(model3, PCscores = TRUE, main = "Model 3. Palmdates: PCA on conc variables")

Palmdates: Spectra variables

When having data sets with more variables than individuals PCA is not directly applicable. In lpda we have implemented the possibility of dealing with such problem, working with the equivalences between the PCA of the data matrix and the transposed matrix.

In palmdates$spectra there are 2050 very correlated measurements as it is showed in the following figure:

Due to the high correlation the application of lpda to all the spectra variables and to the first 2 PCs (model4) give the same solution: 0 prediction errors.

model4 = lpda(data = palmdates$spectra, group = group, pca = TRUE, Variability = 0.9)
plot(model4, PCscores = TRUE, main = "Model 4. Palmdates: PCA on Spectra variables")

Following example shows how predict the group of a test set that does not participate in the model. In this set we include two samples of each group.

test = c(10,11,12,13)
model5 = lpda(data = palmdates$spectra[-test,], group = group[-test], pca = TRUE,
              Variability = 0.9)
predict(model5,  datatest=palmdates$spectra[test,])
#> 
#>        elche10        elche11 salomon.israel  hayani.israel 
#>      "Spanish"      "Spanish"        "Other"        "Other"
summary(model5, datatest=palmdates$spectra[test,], grouptest = group[test])
#> Call:
#> lpda(data = palmdates$spectra[-test, ], group = group[-test], 
#>     pca = TRUE, Variability = 0.9)
#> 
#> CONFUSION MATRIX
#>          Real
#> Predicted Other Spanish
#>   Other       2       0
#>   Spanish     0       2

`iris` data

To show an example with more than two groups we use the famous (Fisher’s or Anderson’s) iris data set that is available in base R. The iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Results with the four variables give 2 classification errors. In this case the model computes 3 hyperplanes for each one of the 3 pairwise comparisons. Now plot function gives 3 plots.

model.iris = lpda(iris[,-5], iris[,5])
summary(model.iris)
#> Call:
#> lpda(data = iris[, -5], group = iris[, 5])
#> 
#> CONFUSION MATRIX
#>             Real
#> Predicted    setosa versicolor virginica
#>   setosa         50          0         0
#>   versicolor      0         49         1
#>   virginica       0          1        49

The application of lpda with 3 PCs gives the same classification error. Function plot with Pcscores = TRUE gives the scores in the first two PCs, but in this case without the hyperplanes.

model.iris2 = lpda(iris[,-5], iris[,5], pca=TRUE, PC=3)
summary(model.iris2)
#> Call:
#> lpda(data = iris[, -5], group = iris[, 5], pca = TRUE, PC = 3)
#> 
#> CONFUSION MATRIX
#>             Real
#> Predicted    setosa versicolor virginica
#>   setosa         50          0         0
#>   versicolor      0         49         1
#>   virginica       0          1        49
plot(model.iris2, PCscores= TRUE)

5. `lpdaCV` function

We can use lpdaCV function to compute the predicted error in different test sets by crossvalidation with a specific model. Two strategies are implemented: loo (leave one out) and ktest that is specified in CV argument. When ktest is selected, ntest is the size of test set (samples not used for computing the model, only to evaluate the number of prediction errors) and R is the times the model is computed and evaluated with different training and test sets.

lpdaCV(palmdates$spectra, group, pca = TRUE, CV = "loo")
lpdaCV(palmdates$spectra, group, pca = TRUE, CV = "ktest", ntest = 5, R = 10)

CV results are so good due to the clear difference between the two groups in spectra data. We can explore ´lpdaCV` possibilities with one of the matrices of RNAseq simulated data, that is noisier than spectra data. Firstly we can see that the model with all the data gets a separating hyperplane.

  data(RNAseq) # 3-dimensional array
  data = RNAseq[,,3] # the third data matrix with dimensions 20 x 600
  group = as.factor(rep(c("G1","G2"), each = 10))
  model = lpda(data, group) # model with all the variables
  summary(model)

Evaluating the error in several test sets with CV functions we can see that, to avoid overfitting, it is preferable the application of lpda with PCA:

lpdaCV(data, group, pca = FALSE, CV = "ktest", ntest = 5)
#> Call:
#> lpdaCV(data = data, group = group, pca = FALSE, CV = "ktest", 
#>     ntest = 5)
#> Prediction Error Rate for analysed model/models: 
#> Original Data 
#>           0.2
lpdaCV(data, group, pca = TRUE, CV = "ktest", ntest = 5)
#> Call:
#> lpdaCV(data = data, group = group, pca = TRUE, CV = "ktest", 
#>     ntest = 5)
#> Prediction Error Rate for analysed model/models: 
#> PC-2 
#>    0

As the success of PCA solution depends on the chosen number of components, there exist the possibility of applying lpdaCV for several percentages of explained variability, for PCs from original data, or several number of PCs, specified in a vector. Examples:

lpdaCV(data, group, pca = TRUE, CV = "ktest", Variability = c(0.1, 0.9))
#> Call:
#> lpdaCV(data = data, group = group, pca = TRUE, Variability = c(0.1, 
#>     0.9), CV = "ktest")
#> Prediction Error Rate for analysed model/models: 
#> Variability-0.1 Variability-0.9 
#>            0.04            0.13
lpdaCV(data, group, pca = TRUE, CV = "ktest", PC= c(2, 10))
#> Call:
#> lpdaCV(data = data, group = group, pca = TRUE, PC = c(2, 10), 
#>     CV = "ktest")
#> Prediction Error Rate for analysed model/models: 
#>  PC-2 PC-10 
#>  0.02  0.06

6. `lpda.3D` function

This function has been designed to apply the method lpda to 3-dimensional data through 2 strategies:

Applying lpda through the third dimension (pfac=FALSE).
Applying lpda to the parafac scores (pfac=TRUE).

For the second strategy function parafac from multiway package is used. More details in [4].

  data(RNAseq) # 3-dimensional array
  dim(RNAseq)
#> [1]  20 600   4
  group = as.factor(rep(c("G1","G2"), each = 10))

Strategy 1

  model3D = lpda.3D(RNAseq, group)
  summary(model3D)
#> CONFUSION MATRIX
#>          Real
#> Predicted G1 G2
#>        G1 10  0
#>        G2  0 10
  predict(model3D)
#> 
#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
#> "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G2" "G2" "G2" "G2" "G2" "G2" 
#>   17   18   19   20 
#> "G2" "G2" "G2" "G2"
  plot(model3D, mfrow=c(2,2))

Strategy 2: with parafac

  model3Ds2 = lpda.3D(RNAseq, group, pfac=TRUE, nfac=2)
  model3Ds2$MOD$mod.pfac$Rsq
#> [1] 0.809859
  predict(model3Ds2)
#> 
#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
#> "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G1" "G2" "G2" "G2" "G2" "G2" "G2" 
#>   17   18   19   20 
#> "G2" "G2" "G2" "G2"
  summary(model3Ds2)
#> CONFUSION MATRIX
#>          Real
#> Predicted G1 G2
#>        G1 10  0
#>        G2  0 10
  plot(model3Ds2, pfacscores=FALSE, main="Parafac Model")

  plot(model3Ds2, pfacscores=TRUE, cex=1.5, main="Parafac components")

7. `lpdaCV.3D` function

In the same way as lpdaCV, this function is useful to evaluate the model and to decide the number of factors in the parafac model.

  lpdaCV.3D(RNAseq, group , CV = "ktest", R=5, ntest=5, pfac=TRUE, nfac=c(2,10))
#> Call:
#> lpdaCV.3D(data = RNAseq, group = group, pfac = TRUE, nfac = c(2, 
#>     10), CV = "ktest", ntest = 5, R = 5)
#> Prediction Error Rate for analysed model/models: 
#>  nfac-2 nfac-10 
#>    0.00    0.12

8. References

[1] Nueda, M.J.; Gandía, C. and Molina, M.D. (2022) LPDA: A new classification method based on linear programming. PLOS ONE 17(7): e0270403. https://doi.org/10.1371/journal.pone.0270403

[2] Nueda, M.J.; Gandía, C. and Molina, M.D. Classifying sequencing data using linear programming. Euro30 Conference, Dublin, June-2019.

[3] Abdrabo, S.S.; Gras, L. and Mora, J. (2013) Analytical methods applied to the chemical characterization and classification of palm dates (Phoenix dactylifera L.) from Elche’s Palm Grove. Ph.D. thesis, Departamento de Química Analítica, Nutrición y Bromatología. Universidad de Alicante.

[4] Gandía, C.; Molina, M.D. and Nueda, M.J. (2025) Adapting the lpda R package to Multiway data. In preparation.

lpda: Linear Programming Discriminant Analysis

Carmen Gandía, Department of Mathematics, Alicante Universiy, Spain

Maria J. Nueda, Department of Mathematics, Alicante Universiy, Spain

11 March 2025

1. The method

2. The package

3. The data

Palmdates data

RNAseq data

4. `lpda` function

Palmdates: Chemical variables

Palmdates: Spectra variables

`iris` data

5. `lpdaCV` function

6. `lpda.3D` function

Strategy 1

Strategy 2: with parafac

7. `lpdaCV.3D` function

8. References

lpda: Linear Programming Discriminant Analysis

Carmen Gandía, Department of Mathematics, Alicante Universiy, Spain

Maria J. Nueda, Department of Mathematics, Alicante Universiy, Spain

11 March 2025

1. The method

2. The package

3. The data

Palmdates data

RNAseq data

4. lpda function

Palmdates: Chemical variables

Palmdates: Spectra variables

iris data

5. lpdaCV function

6. lpda.3D function

Strategy 1

Strategy 2: with parafac

7. lpdaCV.3D function

8. References

4. `lpda` function

`iris` data

5. `lpdaCV` function

6. `lpda.3D` function

7. `lpdaCV.3D` function