library(REMLA)
#> Loading required package: GPArotation
#> Loading required package: geex
#>
#> Attaching package: 'geex'
#> The following object is masked from 'package:methods':
#>
#> show
#> Package 'REMLA' version 1.2.0
#> For documentation, questions, and issues please see github link https://github.com/knieser/REM.
#> Type 'citation("REMLA")' for citing this R package in publications.
This vignette introduces the use of the robust expectation maximization (REM) algorithm; a statistical method for estimating the parameters of a latent variable model in the presence of population heterogeneity as suggested by Nieser & Cochran (2023). The REM algorithm is based on the expectation-maximization (EM) algorithm, but it allows for the case when not all the data are generated by the assumed data generating model.
To illustrate the practical application of the REM (Robust Expectation Maximization) package for both exploratory and confirmatory Factor Analysis, we have chosen to employ it in the context of the Holzinger and Swineford dataset publicly available in the lavaan package.
The Holzinger and Swineford dataset consist of mental ability test scores of seventh and eighth grade children from two different schools. Originally, this dataset consisted of 26 test scores but this modified version in the lavaan package consist of 15 variables, 9 of which are test scores.
id
: student’s Identification number.sex
: Gender (1 = male, 2 = female).ageyr
: Age year part.agemo
: Age month part.school
: School (Pasteur or Grant-White).grade
: Grade (7 = 7th grade, 8 = 8th grade).x1
: Visual perception.x2
: Cubes.x3
: Lozenges.x4
: Paragraph comprehension.x5
: Sentence completion.x6
: Word meaning.x7
: Speeded addition.x8
: Speeded counting of dots.x9
: Speeded discrimination straight and curved
capitals.library(lavaan)
#> This is lavaan 0.6-18
#> lavaan is FREE software! Please report any bugs.
df <- HolzingerSwineford1939
head(df)
#> id sex ageyr agemo school grade x1 x2 x3 x4 x5 x6
#> 1 1 1 13 1 Pasteur 7 3.333333 7.75 0.375 2.333333 5.75 1.2857143
#> 2 2 2 13 7 Pasteur 7 5.333333 5.25 2.125 1.666667 3.00 1.2857143
#> 3 3 2 13 1 Pasteur 7 4.500000 5.25 1.875 1.000000 1.75 0.4285714
#> 4 4 1 13 2 Pasteur 7 5.333333 7.75 3.000 2.666667 4.50 2.4285714
#> 5 5 2 12 2 Pasteur 7 4.833333 4.75 0.875 2.666667 4.00 2.5714286
#> 6 6 2 14 1 Pasteur 7 5.333333 5.00 2.250 1.000000 3.00 0.8571429
#> x7 x8 x9
#> 1 3.391304 5.75 6.361111
#> 2 3.782609 6.25 7.916667
#> 3 3.260870 3.90 4.416667
#> 4 3.000000 5.30 4.861111
#> 5 3.695652 6.30 5.916667
#> 6 4.347826 6.65 7.500000
We limit the data to the columns that will be used in the analysis. Data can be either in a matrix or data frame and must be numeric.
The REM_EFA function, specifically designed for conducting exploratory factor analysis, requires specification of three parameters: X, k_range and delta. The X parameter corresponds to the dataframe or matrix to be analyzed. The k_range parameter comprises a vector of numbers of factors to be modeled. In the example provided, the algorithm conducts exploratory analysis with 1 to 3 factors. Moreover, the delta parameter serves as a hyperparameter, ranging from 0 to 1, and it reflects the researcher’s tolerance level for potentially down-weighting data inaccurately within the model. The default for the delta parameter is 0.05.
The output contains a separate list of estimates for each number of factors under consideration.
model_EFA = REM_EFA(X = data, k_range = 1:3)
#> EFA with 1 factors
#> getting EM estimates...
#> done
#> searching for optimal epsilon
#> done
#> getting REM estimates...
#> done
#> calculating standard errors...
#> done
#> EFA with 2 factors
#> getting EM estimates...
#> done
#> searching for optimal epsilon
#> done
#> getting REM estimates...
#> done
#> calculating standard errors...
#> done
#> EFA with 3 factors
#> getting EM estimates...
#> done
#> searching for optimal epsilon
#> done
#> getting REM estimates...
#> done
#> calculating standard errors...
#> done
The summary function displays the estimates of the optimal model which is selected using the lowest score of Bayesian Information Criterion (BIC) among the number of factors considered. In other words, it selects the optimal model after considering the BIC in the models with 1, 2, and 3 factors.
summary(model_EFA)
#> -----Exploratory factor analysis using REM-----
#> Call: REM_EFA(X = data, k_range = 1:3)
#>
#> According to BIC evaluated at REM estimates,
#> the optimal number of factors is: 3
#>
#> REM estimates w/ delta = 0.05 ( epsilon = 9.216477e-07 )
#> --------------------------
#> Est. gamma = 0.843
#>
#> Intercept Factor 1 Factor 2 Factor 3 Res Var
#> x1 4.911 0.112 0.709 -0.043 0.373
#> x2 5.657 0.053 0.393 -0.141 0.821
#> x3 2.051 -0.146 0.562 0.044 0.686
#> x4 2.879 0.799 0.098 0.002 0.295
#> x5 3.529 0.925 -0.059 -0.006 0.210
#> x6 2.261 0.834 0.013 0.013 0.305
#> x7 3.894 0.044 -0.132 0.780 0.460
#> x8 5.847 -0.044 0.045 0.766 0.449
#> x9 5.737 -0.021 0.375 0.512 0.507
#>
#> Factor correlations
#> Factor 1 Factor 2 Factor 3
#> Factor 1 1.000 0.285 0.339
#> Factor 2 0.285 1.000 0.135
#> Factor 3 0.339 0.135 1.000
#>
#> EM estimates
#> --------------------------
#> Intercept Factor 1 Factor 2 Factor 3 Res Var
#> x1 4.237 0.178 0.589 0.028 0.513
#> x2 5.180 0.037 0.495 -0.126 0.749
#> x3 1.993 -0.084 0.674 0.021 0.543
#> x4 2.641 0.843 0.020 0.003 0.282
#> x5 3.379 0.894 -0.067 0.004 0.244
#> x6 2.003 0.810 0.075 -0.016 0.307
#> x7 3.849 0.026 -0.150 0.762 0.497
#> x8 5.468 -0.054 0.103 0.731 0.473
#> x9 5.335 0.015 0.358 0.482 0.544
#>
#> Factor correlations
#> Factor 1 Factor 2 Factor 3
#> Factor 1 1.000 0.149 0.357
#> Factor 2 0.149 1.000 0.030
#> Factor 3 0.357 0.030 1.000
#>
#> Difference (EM - REM)
#> --------------------------
#> Intercept Factor 1 Factor 2 Factor 3 Res Var
#> x1 -0.674 0.066 -0.119 0.071 0.140
#> x2 -0.477 -0.016 0.102 0.015 -0.072
#> x3 -0.057 0.061 0.111 -0.023 -0.142
#> x4 -0.238 0.044 -0.078 0.001 -0.013
#> x5 -0.150 -0.030 -0.007 0.010 0.034
#> x6 -0.257 -0.024 0.063 -0.028 0.002
#> x7 -0.045 -0.018 -0.018 -0.018 0.037
#> x8 -0.379 -0.010 0.057 -0.035 0.024
#> x9 -0.402 0.035 -0.017 -0.030 0.037
#>
#> Factor correlations
#> Factor 1 Factor 2 Factor 3
#> Factor 1 0.000 -0.136 0.017
#> Factor 2 -0.136 0.000 -0.106
#> Factor 3 0.018 -0.106 0.000
From the REM and EM output above, note that \(x_4\), \(x_5\), and \(x_6\) strongly load on Factor 1. Similarly, \(x_1\), \(x_2\), and \(x_3\) moderately load on Factor 2 and that \(x_7\), \(x_8\) and \(x_9\) moderately load on Factor 3. Additionally, the gamma or average REM weight was 0.849, indicating that about 15 percent of the sample differed in the factor model parameters from the majority of the sample.
To conduct a confirmatory factor analysis, we need to specify a model which consists of structural equations relating the measured variables with the latent variables. We consider the following three factor model:
\[\text{Visual} = x_1 + x_2 + x_3 \] \[\text{Textual} = x_4 + x_5 + x_6 \] \[\text{Speed} = x_7 + x_8 + x_9\] To create the model parameter, first create a string variable that contains each equation in a new line and separate equalities using the \(=\sim\) symbol as shown below.
# Define your model as a string
model <- " Visual =~ x1 + x2 + x3
Textual =~ x4 + x5 + x6
Speed =~ x7 + x8 + x9
"
This model is used to investigate the relationships between the latent variables “Visual”,“Textual”, and “Speed” and their respective sets of observed variables by examining how well the observed variables predict the underlying constructs. The model will estimate coefficients for these relationships, providing information about the strength and direction of these associations.
Finally, use the REM_CFA() function to perform the confirmatory factor analysis. For this include X (the numeric data frame under consideration), delta (a hyperparameter), and the model parameter.
# CFA model with delta = 0.05
model_CFA = REM_CFA(X = data, model = model)
#> CFA with 3 factors
#> getting EM estimates...
#> done
#> searching for optimal epsilon
#> done
#> getting REM estimates...
#> done
#> calculating standard errors...
#> done
summary(model_CFA)
#> -----Confirmatory factor analysis using REM-----
#> Call: REM_CFA(X = data, model = model)
#>
#> Model: Visual =~ x1 + x2 + x3
#> Textual =~ x4 + x5 + x6
#> Speed =~ x7 + x8 + x9
#>
#>
#> REM estimates w/ delta = 0.05 ( epsilon = 8.005012e-07 )
#> --------------------------
#> Est. gamma = 0.83
#>
#> Intercept Factor 1 Factor 2 Factor 3 Res Var
#> x1 5.034 0.821 0.000 0.000 0.326
#> x2 5.754 0.359 0.000 0.000 0.871
#> x3 2.058 0.500 0.000 0.000 0.750
#> x4 2.896 0.000 -0.843 0.000 0.289
#> x5 3.557 0.000 -0.871 0.000 0.241
#> x6 2.269 0.000 -0.840 0.000 0.295
#> x7 3.895 0.000 0.000 0.634 0.598
#> x8 5.858 0.000 0.000 0.752 0.435
#> x9 5.757 0.000 0.000 0.657 0.569
#>
#> Factor correlations
#> Factor 1 Factor 2 Factor 3
#> Factor 1 1.000 -0.496 0.359
#> Factor 2 -0.496 1.000 -0.261
#> Factor 3 0.359 -0.261 1.000
#>
#> EM estimates
#> --------------------------
#> Intercept Factor 1 Factor 2 Factor 3 Res Var
#> x1 4.236 0.764 0.000 0.000 0.416
#> x2 5.180 0.427 0.000 0.000 0.818
#> x3 1.993 0.585 0.000 0.000 0.657
#> x4 2.637 0.000 -0.851 0.000 0.276
#> x5 3.373 0.000 -0.855 0.000 0.270
#> x6 2.001 0.000 -0.838 0.000 0.298
#> x7 3.848 0.000 0.000 0.567 0.679
#> x8 5.467 0.000 0.000 0.719 0.483
#> x9 5.334 0.000 0.000 0.670 0.552
#>
#> Factor correlations
#> Factor 1 Factor 2 Factor 3
#> Factor 1 1.000 -0.459 0.478
#> Factor 2 -0.459 1.000 -0.285
#> Factor 3 0.478 -0.285 1.000
#>
#> Difference (EM - REM)
#> --------------------------
#> Intercept Factor 1 Factor 2 Factor 3 Res Var
#> x1 -0.798 -0.057 0.000 0.000 0.090
#> x2 -0.574 0.068 0.000 0.000 -0.053
#> x3 -0.064 0.085 0.000 0.000 -0.092
#> x4 -0.259 0.000 -0.008 0.000 -0.013
#> x5 -0.184 0.000 0.017 0.000 0.029
#> x6 -0.269 0.000 0.002 0.000 0.003
#> x7 -0.046 0.000 0.000 -0.067 0.081
#> x8 -0.391 0.000 0.000 -0.033 0.049
#> x9 -0.423 0.000 0.000 0.013 -0.017
#>
#> Factor correlations
#> Factor 1 Factor 2 Factor 3
#> Factor 1 0.000 0.037 0.119
#> Factor 2 0.038 0.000 -0.024
#> Factor 3 0.119 -0.024 0.000
The output above depicts estimates for the confirmatory analysis. Note that again \(x_4\), \(x_5\), and \(x_6\) are strongly correlated with Factor 1. Similarly, \(x_1\),\(x_2\), and \(x_3\) are moderately correlated with factor 2 and that \(x_7\), \(x_8\) and \(x_9\) are moderately correlated with Factor 3. However, the gamma or average REM weight was 0.833. Note that the difference between the EM and REM intercepts are negative indicating that this down-weighted group has lower average item scores.