Structural Equation Modeling | Lab Session 4
1 What we are going to cover
- Non-normal continuous data
- Categorical data
- Missing data
2 Data
The data set used throughout is the European Social Survey ESS4-2008 Edition 4.5 was released on 1 December 2018. We will restrict the analysis to the Belgian case. Each line in the data set represents a Belgian respondent. The full dataset an documentation can be found on the ESS website
Codebook:
gvslvol Standard of living for the old, governments’ responsibility (0 Not governments’ responsibility at all - 10 Entirely governments’ responsibility)
gvslvue Standard of living for the unemployed, governments’ responsibility (0 Not governments’ responsibility at all - 10 Entirely governments’ responsibility)
gvhlthc Health care for the sick, governments’ responsibility (0 Not governments’ responsibility at all - 10 Entirely governments’ responsibility)
gvcldcr Child care services for working parents, governments’ responsibility (0 Not governments’ responsibility at all - 10 Entirely governments’ responsibility)
gvjbevn Job for everyone, governments’ responsibility (0 Not governments’ responsibility at all - 10 Entirely governments’ responsibility)
gvpdlwk Paid leave from work to care for sick family, governments’ responsibility (0 Not governments’ responsibility at all - 10 Entirely governments’ responsibility)
sbstrec Social benefits/services place too great strain on economy (1 Agree strongly - 5 Disagree strongly)
sbbsntx Social benefits/services cost businesses too much in taxes/charges (1 Agree strongly - 5 Disagree strongly)
sbprvpv Social benefits/services prevent widespread poverty (1 Agree strongly - 5 Disagree strongly)
sbeqsoc Social benefits/services lead to a more equal society (1 Agree strongly - 5 Disagree strongly)
sbcwkfm Social benefits/services make it easier to combine work and family (1 Agree strongly - 5 Disagree strongly)
sblazy Social benefits/services make people lazy (1 Agree strongly - 5 Disagree strongly)
sblwcoa Social benefits/services make people less willing care for one another (1 Agree strongly - 5 Disagree strongly)
sblwlka Social benefits/services make people less willing look after themselves/family (1 Agree strongly - 5 Disagree strongly)
agea Respondent’s age
eduyrs Years of full-time education completed
gndr Gender (1 Male, 2 Female)
hinctnta Household’s total net income, all sources (Deciles of the actual household income range in the given country.)
gincdif Government should reduce differences in income levels (1 Agree strongly - 5 Disagree strongly)
dfincac Large differences in income acceptable to reward talents and efforts (1 Agree strongly - 5 Disagree strongly)
smdfslv For fair society, differences in standard of living should be small (1 Agree strongly - 5 Disagree strongly)
3 Environment preparation
First, let’s load the necessary packages to load, manipulate, visualize and analyse the data.
# Uncomment this once if you need to install the packages on your system
### DATA MANIPULATION ###
# install.packages("haven") # data import from spss
# install.packages("dplyr") # data manipulation
# install.packages("psych") # descriptives
# install.packages("stringr") # string manipulation
# install.packages("purrr") # table manipulation
### MODELING ###
# install.packages("lavaan") # SEM modelling
#### FOR MVN we need install.packages("gsl").
#### ON mac with homebrewed packaged:
#### CFLAGS="-I/opt/homebrew/Cellar/gsl/2.7.1/include" LDFLAGS="-L/opt/homebrew/Cellar/gsl/2.7.1/lib -lgsl -lgslcblas" R
#### install.packages("gsl")
# install.packages("MVN") # tests for multivariate normality
# install.packages("Amelia") # performing multiple imputation
# install.packages("semTools") # fit a model to multiple imputed data set
### VISUALIZATION ###
# install.packages("tidySEM") # plotting SEM models
# install.packages("ggplot2") # plotting
# install.packages("patchwork") # organization plots
# Load the packages
### DATA MANIPULATION ###
library("haven")
library("dplyr")
library("psych")
library("stringr")
library("purrr")
### MODELING ###
library("lavaan")
library("MVN")
library("Amelia")
#library("semTools") # loaded later for compatibility issues
### VISUALIZATION ###
library("tidySEM")
library("ggplot2")
library("patchwork")
4 Non-normal, continuous data
Special data analytic techniques for non-normal continuous variables should be used if any of the continuous variables in the model are non-normal. maximum likelihood (ML) with non normal continuous variables results in biased standard errors and model fit deterioration. Practically, this means that that chi-square will be erroneously too large and standard errors are too small increasing the probability of Type-1 errors. The parameter estimates (e.g., loadings, regression coefficients) are largely unaffected (Finch, West, and MacKinnon 1997).
4.1 Detecting non-normality
Unadjusted ML analysis approach most commonly employed for SEM assumes multivariate normality. One first step to assess multivariate normality consist in checking if the univariate distributions are normal. If the univariate distributions are non-normal, then the multivariate distribution will be non normal. We can plot the univariate distribution of our variables and perform a Kolmogorov-Smirnov (KS) test.
<- haven::read_sav("https://github.com/albertostefanelli/SEM_labs/raw/master/data/ESS4_belgium.sav")
ess_df
# We are going to measure welfare support using 4 items: gvslvol gvhlthc gvcldcr gvpdlwk
# First let's visually inspect our variables
<- ggplot(ess_df, aes(gvslvol)) +
h_gvslvol geom_blank() +
geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", alpha=0.3)
<- ggplot(ess_df, aes(gvhlthc)) +
h_gvhlthc geom_blank() +
geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", alpha=0.3)
<- ggplot(ess_df, aes(gvcldcr)) +
h_gvcldcr geom_blank() +
geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", alpha=0.3)
<- ggplot(ess_df, aes(gvpdlwk)) +
h_gvpdlwk geom_blank() +
geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", alpha=0.3)
+ h_gvhlthc + h_gvcldcr + h_gvpdlwk h_gvslvol
We can check if a distribution is normal (or almost) using Kolmogorov-Smirnov (KS) test. The KS test returns a D statistic and a p-value corresponding to the D statistic. The D statistic is the absolute maximum distance between the cumulative distribution function of our variable and a randomly generated normally distributed variable with the same mean and standard deviation. The closer D is to 0 the more the distribution of our variable resemble a normal distribution.
<- ks.test(ess_df$gvslvol, "pnorm", mean=mean(ess_df$gvslvol, na.rm=T), sd=sd(ess_df$gvslvol, na.rm=T))
ks_gvslvol <- ks.test(ess_df$gvhlthc, "pnorm", mean=mean(ess_df$gvhlthc, na.rm=T), sd=sd(ess_df$gvhlthc, na.rm=T))
ks_gvhlthc <- ks.test(ess_df$gvcldcr, "pnorm", mean=mean(ess_df$gvcldcr, na.rm=T), sd=sd(ess_df$gvcldcr, na.rm=T))
ks_gvcldcr <- ks.test(ess_df$gvpdlwk, "pnorm", mean=mean(ess_df$gvpdlwk, na.rm=T), sd=sd(ess_df$gvpdlwk, na.rm=T))
ks_gvpdlwk
data.frame(Variables= c("gvslvol","gvhlthc","gvcldcr", "gvpdlwk"),
D = round(c(ks_gvslvol$statistic, ks_gvhlthc$statistic, ks_gvcldcr$statistic, ks_gvpdlwk$statistic),2),
"P-value" = c(ks_gvslvol$p.value, ks_gvhlthc$p.value, ks_gvcldcr$p.value, ks_gvpdlwk$p.value)
)
Variables D P.value
1 gvslvol 0.19 0
2 gvhlthc 0.20 0
3 gvcldcr 0.17 0
4 gvpdlwk 0.17 0
We can also also have multivariate non normality even when all the individual variables are normally distributed (although severe multivariate non normality is probably less likely). There are different multivariate normality tests (i.e., Mardia, Royston, Doornik-Hansen), we are going to use the Henze-Zirkler’s multivariate normality test. For multivariate normality, the test p-value should be greater than 0.05
# select the ws items
<- ess_df[,c("gvslvol", "gvhlthc", "gvcldcr", "gvpdlwk")]
ess_df_ws # remove NAs
<- na.omit(ess_df_ws)
ess_df_ws_na
<- mvn(data = ess_df_ws_na, # our data without NAs
mvn_test mvnTest = c("hz") # type of normality test to perform
)
$multivariateNormality mvn_test
Test HZ p value MVN
1 Henze-Zirkler 27.21174 0 NO
To mitigate non-normality we can use scaled \({\chi^2}\) and “robust” standard errors corrections to ML estimation as in Satorra and Bentler (1988; 1994). Adjustments are made to the \({\chi^2}\) (and \({\chi^2}\) based fit indices) and standard errors based on a weight matrix derived from an estimate of multivariate kurtosis (as said before, the parameter estimates themselves are not altered).
<-'
model_ws ws =~ gvslvol + # Standard of living for the old
gvhlthc + # Health care for the sick
gvcldcr + # Child care services for working parents
gvpdlwk # Paid leave from work to care for sick family
'
<- cfa(model_ws, # model formula
fit_ws_ml data = ess_df, # data frame
estimator = "ML" # select the estimator
)
<- cfa(model_ws, # model formula
fit_ws_mlr data = ess_df, # data frame
estimator = "MLM" # select the estimator
)
summary(fit_ws_mlr,
fit.measures=TRUE)
lavaan 0.6-10 ended normally after 23 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 8
Used Total
Number of observations 1745 1760
Model Test User Model:
Standard Robust
Test Statistic 266.283 106.210
Degrees of freedom 2 2
P-value (Chi-square) 0.000 0.000
Scaling correction factor 2.507
Satorra-Bentler correction
Model Test Baseline Model:
Test statistic 1921.621 1147.034
Degrees of freedom 6 6
P-value 0.000 0.000
Scaling correction factor 1.675
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.862 0.909
Tucker-Lewis Index (TLI) 0.586 0.726
Robust Comparative Fit Index (CFI) 0.863
Robust Tucker-Lewis Index (TLI) 0.590
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -12411.000 -12411.000
Loglikelihood unrestricted model (H1) NA NA
Akaike (AIC) 24837.999 24837.999
Bayesian (BIC) 24881.715 24881.715
Sample-size adjusted Bayesian (BIC) 24856.300 24856.300
Root Mean Square Error of Approximation:
RMSEA 0.275 0.173
90 Percent confidence interval - lower 0.248 0.155
90 Percent confidence interval - upper 0.304 0.191
P-value RMSEA <= 0.05 0.000 0.000
Robust RMSEA 0.274
90 Percent confidence interval - lower 0.231
90 Percent confidence interval - upper 0.319
Standardized Root Mean Square Residual:
SRMR 0.079 0.079
Parameter Estimates:
Standard errors Robust.sem
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|)
ws =~
gvslvol 1.000
gvhlthc 0.921 0.044 20.858 0.000
gvcldcr 0.852 0.052 16.287 0.000
gvpdlwk 0.876 0.051 17.080 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.gvslvol 0.846 0.080 10.619 0.000
.gvhlthc 1.024 0.084 12.134 0.000
.gvcldcr 2.126 0.134 15.832 0.000
.gvpdlwk 2.055 0.138 14.915 0.000
ws 1.350 0.103 13.124 0.000
5 Categorical data
- binary: male/female, dead/alive
- nominal with \(K>2\): race (black, latino, asian)
- ordinal: ses (high, middle, low), likert scales (agree strongly, agree, neutral, disagree, disagree strongly)
- counts: number of deadly accidents in a year
Lavaan can deal only with binary and ordinal endogenous variables. There are two approaches to categorical variables in SEM. Only the three-stage WLS approach is well implemented in Lavaan and includes ‘robust’ variants (the other approach is called full information approach and the estimator is reffered to marginal maximum likelihood). Let’s use the social (sbprvpv, sbeqsoc, sbcwkfm) and moral (sblazy, sblwcoa, sblwlka) criticism items to estimate a model with two first-order ordered latent variables.
Stages of WLS:
- An observed variable \(y\) can be seen as a latent continuous variable \(y^*\), in our social criticism example an ordinal variable with K = 5 response categories. The model estimates thresholds using univariate information only. This is part of the mean structure of the model.
- Keeping the thresholds fixed, the model estimates the correlations. Only the correlation matrix is used, rather than covariance matrix. In our example with ordinal data, the model estimates polychoric correlations.
- The full SEM model is estimated using weighted least squares. If exogenous covariates are involved (i.e. regression paths), the correlations are adjusted on based on the values of \(y^*\). Residual variances of \(y^*\) is estimated (called “scale parameters”)
<-'
model_wc ## Social criticism ##
wc_socia =~ sbprvpv + sbeqsoc + sbcwkfm
## Moral criticism ##
wc_moral =~ sblazy + sblwcoa + sblwlka
'
<- cfa(model_wc,
fit_wc_wlsmv ordered = c("sbprvpv",
"sbeqsoc",
"sbcwkfm",
"sblazy",
"sblwcoa",
"sblwlka" ),
data = ess_df,
estimator = "WLSMV"
)summary(fit_wc_wlsmv, standardized=TRUE)
lavaan 0.6-10 ended normally after 23 iterations
Estimator DWLS
Optimization method NLMINB
Number of model parameters 31
Used Total
Number of observations 1711 1760
Model Test User Model:
Standard Robust
Test Statistic 21.993 29.388
Degrees of freedom 8 8
P-value (Chi-square) 0.005 0.000
Scaling correction factor 0.783
Shift parameter 1.287
simple second-order correction
Parameter Estimates:
Standard errors Robust.sem
Information Expected
Information saturated (h1) model Unstructured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
wc_socia =~
sbprvpv 1.000 0.610 0.610
sbeqsoc 1.443 0.120 11.979 0.000 0.880 0.880
sbcwkfm 0.691 0.044 15.740 0.000 0.422 0.422
wc_moral =~
sblazy 1.000 0.736 0.736
sblwcoa 1.132 0.027 42.700 0.000 0.833 0.833
sblwlka 1.015 0.024 42.012 0.000 0.747 0.747
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
wc_socia ~~
wc_moral -0.041 0.014 -2.890 0.004 -0.090 -0.090
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.sbprvpv 0.000 0.000 0.000
.sbeqsoc 0.000 0.000 0.000
.sbcwkfm 0.000 0.000 0.000
.sblazy 0.000 0.000 0.000
.sblwcoa 0.000 0.000 0.000
.sblwlka 0.000 0.000 0.000
wc_socia 0.000 0.000 0.000
wc_moral 0.000 0.000 0.000
Thresholds:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
sbprvpv|t1 -1.221 0.040 -30.423 0.000 -1.221 -1.221
sbprvpv|t2 0.528 0.032 16.562 0.000 0.528 0.528
sbprvpv|t3 1.111 0.038 29.093 0.000 1.111 1.111
sbprvpv|t4 2.213 0.081 27.384 0.000 2.213 2.213
sbeqsoc|t1 -1.420 0.045 -31.912 0.000 -1.420 -1.420
sbeqsoc|t2 0.487 0.032 15.375 0.000 0.487 0.487
sbeqsoc|t3 1.150 0.039 29.605 0.000 1.150 1.150
sbeqsoc|t4 2.165 0.077 28.025 0.000 2.165 2.165
sbcwkfm|t1 -1.462 0.046 -32.072 0.000 -1.462 -1.462
sbcwkfm|t2 0.538 0.032 16.846 0.000 0.538 0.538
sbcwkfm|t3 1.279 0.041 30.976 0.000 1.279 1.279
sbcwkfm|t4 2.375 0.095 25.029 0.000 2.375 2.375
sblazy|t1 -1.458 0.045 -32.058 0.000 -1.458 -1.458
sblazy|t2 -0.204 0.031 -6.690 0.000 -0.204 -0.204
sblazy|t3 0.428 0.031 13.659 0.000 0.428 0.428
sblazy|t4 1.631 0.051 32.211 0.000 1.631 1.631
sblwcoa|t1 -1.573 0.049 -32.255 0.000 -1.573 -1.573
sblwcoa|t2 -0.198 0.031 -6.497 0.000 -0.198 -0.198
sblwcoa|t3 0.355 0.031 11.455 0.000 0.355 0.355
sblwcoa|t4 1.747 0.055 31.853 0.000 1.747 1.747
sblwlka|t1 -1.548 0.048 -32.245 0.000 -1.548 -1.548
sblwlka|t2 -0.275 0.031 -8.955 0.000 -0.275 -0.275
sblwlka|t3 0.272 0.031 8.859 0.000 0.272 0.272
sblwlka|t4 1.804 0.057 31.554 0.000 1.804 1.804
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.sbprvpv 0.628 0.628 0.628
.sbeqsoc 0.225 0.225 0.225
.sbcwkfm 0.822 0.822 0.822
.sblazy 0.459 0.459 0.459
.sblwcoa 0.306 0.306 0.306
.sblwlka 0.442 0.442 0.442
wc_socia 0.372 0.036 10.462 0.000 1.000 1.000
wc_moral 0.541 0.020 27.352 0.000 1.000 1.000
Scales y*:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
sbprvpv 1.000 1.000 1.000
sbeqsoc 1.000 1.000 1.000
sbcwkfm 1.000 1.000 1.000
sblazy 1.000 1.000 1.000
sblwcoa 1.000 1.000 1.000
sblwlka 1.000 1.000 1.000
6 Missing data
Different types of missingness:
- Missing completely at random (MCAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. When data are MCAR, the analyses performed on the data are unbiased; however, data are rarely MCAR.
- Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables on which there is complete information. MAR is an assumption that is impossible to verify statistically, we must rely on its substantive reasonableness.
- Missing not at random (MNAR) (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR (i.e. the value of the variable that’s missing is related to the reason it’s missing).
Three different approaches to missing data (See Brown, Chapter 9)
- Default behaviour: listwise deletion
- Best option: Case-wise (or ‘full information’) maximum likelihood (FIML). Can be employed ONLY when the missing mechanism is MCAR or MAR. A scaled (‘Yuan-Bentler’) test statistic similar to the one used for non-normal continuous data can be computed.
- Second best option: Multiple imputation
<-'
model_wc ## Economic criticism ##
wc_econo =~ sbstrec + sbbsntx
## Social criticism ##
wc_socia =~ sbprvpv + sbeqsoc + sbcwkfm
## Moral criticism ##
wc_moral =~ sblazy + sblwcoa + sblwlka
'
# listwise
<- cfa(model_wc, # model formula
fit_wc_6_listwise data = ess_df, # data frame
missing = "listwise" # listwise
)
summary(fit_wc_6_listwise, standardized=TRUE)
lavaan 0.6-10 ended normally after 36 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 19
Used Total
Number of observations 1679 1760
Model Test User Model:
Test statistic 71.348
Degrees of freedom 17
P-value (Chi-square) 0.000
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
wc_econo =~
sbstrec 1.000 0.628 0.616
sbbsntx 1.001 0.093 10.817 0.000 0.629 0.629
wc_socia =~
sbprvpv 1.000 0.462 0.527
sbeqsoc 1.493 0.166 9.011 0.000 0.689 0.824
sbcwkfm 0.631 0.055 11.461 0.000 0.291 0.375
wc_moral =~
sblazy 1.000 0.749 0.710
sblwcoa 1.058 0.045 23.285 0.000 0.792 0.766
sblwlka 0.977 0.043 22.843 0.000 0.732 0.706
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
wc_econo ~~
wc_socia -0.024 0.011 -2.093 0.036 -0.082 -0.082
wc_moral 0.250 0.023 10.872 0.000 0.532 0.532
wc_socia ~~
wc_moral -0.034 0.012 -2.915 0.004 -0.099 -0.099
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.sbstrec 0.647 0.042 15.381 0.000 0.647 0.621
.sbbsntx 0.603 0.041 14.584 0.000 0.603 0.604
.sbprvpv 0.554 0.030 18.722 0.000 0.554 0.722
.sbeqsoc 0.224 0.051 4.404 0.000 0.224 0.320
.sbcwkfm 0.518 0.020 25.832 0.000 0.518 0.859
.sblazy 0.551 0.027 20.083 0.000 0.551 0.495
.sblwcoa 0.441 0.027 16.639 0.000 0.441 0.413
.sblwlka 0.539 0.027 20.320 0.000 0.539 0.502
wc_econo 0.395 0.045 8.689 0.000 1.000 1.000
wc_socia 0.213 0.029 7.328 0.000 1.000 1.000
wc_moral 0.561 0.039 14.477 0.000 1.000 1.000
# Full information maximum likelihood (FIML)
<- cfa(model_wc, # model formula
fit_wc_6_fiml data = ess_df, # data frame
missing = "direct" # alias: "ml" or "fiml"
)
summary(fit_wc_6_fiml, standardized=TRUE)
lavaan 0.6-10 ended normally after 45 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 27
Used Total
Number of observations 1758 1760
Number of missing patterns 34
Model Test User Model:
Test statistic 75.087
Degrees of freedom 17
P-value (Chi-square) 0.000
Parameter Estimates:
Standard errors Standard
Information Observed
Observed information based on Hessian
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
wc_econo =~
sbstrec 1.000 0.622 0.610
sbbsntx 1.011 0.092 10.931 0.000 0.628 0.631
wc_socia =~
sbprvpv 1.000 0.472 0.541
sbeqsoc 1.420 0.172 8.270 0.000 0.670 0.804
sbcwkfm 0.627 0.053 11.752 0.000 0.296 0.381
wc_moral =~
sblazy 1.000 0.747 0.708
sblwcoa 1.065 0.045 23.481 0.000 0.795 0.769
sblwlka 0.977 0.043 22.799 0.000 0.729 0.705
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
wc_econo ~~
wc_socia -0.019 0.012 -1.592 0.111 -0.065 -0.065
wc_moral 0.250 0.023 10.878 0.000 0.539 0.539
wc_socia ~~
wc_moral -0.034 0.012 -2.900 0.004 -0.095 -0.095
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.sbstrec 2.962 0.024 121.159 0.000 2.962 2.907
.sbbsntx 2.535 0.024 105.527 0.000 2.535 2.544
.sbprvpv 2.334 0.021 111.740 0.000 2.334 2.677
.sbeqsoc 2.375 0.020 119.100 0.000 2.375 2.848
.sbcwkfm 2.335 0.019 125.202 0.000 2.335 3.004
.sblazy 2.894 0.025 114.871 0.000 2.894 2.743
.sblwcoa 2.920 0.025 118.148 0.000 2.920 2.821
.sblwlka 2.979 0.025 120.502 0.000 2.979 2.881
wc_econo 0.000 0.000 0.000
wc_socia 0.000 0.000 0.000
wc_moral 0.000 0.000 0.000
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.sbstrec 0.652 0.041 15.866 0.000 0.652 0.628
.sbbsntx 0.598 0.041 14.681 0.000 0.598 0.602
.sbprvpv 0.537 0.032 16.919 0.000 0.537 0.707
.sbeqsoc 0.246 0.053 4.611 0.000 0.246 0.354
.sbcwkfm 0.516 0.020 25.386 0.000 0.516 0.855
.sblazy 0.556 0.027 20.343 0.000 0.556 0.499
.sblwcoa 0.438 0.026 16.852 0.000 0.438 0.409
.sblwlka 0.538 0.026 20.842 0.000 0.538 0.503
wc_econo 0.387 0.044 8.737 0.000 1.000 1.000
wc_socia 0.223 0.032 7.019 0.000 1.000 1.000
wc_moral 0.558 0.038 14.626 0.000 1.000 1.000
# combine both robust standard errors (sandwich) and a scaled test statistic (Satorra Bentler)
<- cfa(model_wc, # model formula
fit_wc_6_fiml_mr data = ess_df, # data frame
missing = "direct", # alias: "ml" or "fiml"
estimator = "MLR" # Estimator: ML Robust
)
summary(fit_wc_6_fiml_mr,
fit.measures=TRUE,
standardized=TRUE)
lavaan 0.6-10 ended normally after 45 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 27
Used Total
Number of observations 1758 1760
Number of missing patterns 34
Model Test User Model:
Standard Robust
Test Statistic 75.087 66.620
Degrees of freedom 17 17
P-value (Chi-square) 0.000 0.000
Scaling correction factor 1.127
Yuan-Bentler correction (Mplus variant)
Model Test Baseline Model:
Test statistic 2508.464 1964.310
Degrees of freedom 28 28
P-value 0.000 0.000
Scaling correction factor 1.277
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.977 0.974
Tucker-Lewis Index (TLI) 0.961 0.958
Robust Comparative Fit Index (CFI) 0.977
Robust Tucker-Lewis Index (TLI) 0.963
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -17781.687 -17781.687
Scaling correction factor 1.163
for the MLR correction
Loglikelihood unrestricted model (H1) NA NA
Scaling correction factor 1.149
for the MLR correction
Akaike (AIC) 35617.374 35617.374
Bayesian (BIC) 35765.116 35765.116
Sample-size adjusted Bayesian (BIC) 35679.339 35679.339
Root Mean Square Error of Approximation:
RMSEA 0.044 0.041
90 Percent confidence interval - lower 0.034 0.031
90 Percent confidence interval - upper 0.055 0.051
P-value RMSEA <= 0.05 0.817 0.937
Robust RMSEA 0.043
90 Percent confidence interval - lower 0.033
90 Percent confidence interval - upper 0.054
Standardized Root Mean Square Residual:
SRMR 0.029 0.029
Parameter Estimates:
Standard errors Sandwich
Information bread Observed
Observed information based on Hessian
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
wc_econo =~
sbstrec 1.000 0.622 0.610
sbbsntx 1.011 0.101 10.050 0.000 0.628 0.631
wc_socia =~
sbprvpv 1.000 0.472 0.541
sbeqsoc 1.420 0.253 5.609 0.000 0.670 0.804
sbcwkfm 0.627 0.063 9.900 0.000 0.296 0.381
wc_moral =~
sblazy 1.000 0.747 0.708
sblwcoa 1.065 0.046 23.372 0.000 0.795 0.769
sblwlka 0.977 0.046 21.123 0.000 0.729 0.705
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
wc_econo ~~
wc_socia -0.019 0.017 -1.158 0.247 -0.065 -0.065
wc_moral 0.250 0.025 10.032 0.000 0.539 0.539
wc_socia ~~
wc_moral -0.034 0.013 -2.615 0.009 -0.095 -0.095
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.sbstrec 2.962 0.024 121.142 0.000 2.962 2.907
.sbbsntx 2.535 0.024 105.364 0.000 2.535 2.544
.sbprvpv 2.334 0.021 111.728 0.000 2.334 2.677
.sbeqsoc 2.375 0.020 119.096 0.000 2.375 2.848
.sbcwkfm 2.335 0.019 125.195 0.000 2.335 3.004
.sblazy 2.894 0.025 114.879 0.000 2.894 2.743
.sblwcoa 2.920 0.025 118.149 0.000 2.920 2.821
.sblwlka 2.979 0.025 120.489 0.000 2.979 2.881
wc_econo 0.000 0.000 0.000
wc_socia 0.000 0.000 0.000
wc_moral 0.000 0.000 0.000
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.sbstrec 0.652 0.048 13.681 0.000 0.652 0.628
.sbbsntx 0.598 0.044 13.742 0.000 0.598 0.602
.sbprvpv 0.537 0.044 12.185 0.000 0.537 0.707
.sbeqsoc 0.246 0.078 3.151 0.002 0.246 0.354
.sbcwkfm 0.516 0.026 19.909 0.000 0.516 0.855
.sblazy 0.556 0.032 17.227 0.000 0.556 0.499
.sblwcoa 0.438 0.031 14.278 0.000 0.438 0.409
.sblwlka 0.538 0.032 17.058 0.000 0.538 0.503
wc_econo 0.387 0.048 8.093 0.000 1.000 1.000
wc_socia 0.223 0.045 4.934 0.000 1.000 1.000
wc_moral 0.558 0.038 14.865 0.000 1.000 1.000
FIML has lower standard errors. Why? Because FIML makes more efficient use of the data at hand and preserves statistical power (lower standard errors). This is especially true when we have a substantial amount of missings in our data. Therefore, even if your missing pattern is MCAR, it is a better option than listwise deletion.
Q: Why combining FIML with Robust Standard errors leads to an increase in SEs?
6.1 Imputing missing data
The third option is to perform multiple imputation of the missing data on our data. We can use the r package Amelia to perform multiple imputations. Amelia creates a bootstrapped version of the original data and then imputes the missing values of the original data. This involves imputing m values for each missing cell in your data matrix and creating m “completed” data sets. Across these different data sets, the observed values are the same, but the missing values are filled in with different imputations that reflect our uncertainty about the missing data.
# select the welfare criticism items
<- ess_df %>% select(sbstrec, sbbsntx, sbprvpv, sbeqsoc, sbcwkfm, sblazy, sblwcoa, sblwlka)
ess_df_wc # Amelia explicitly requires a object of data.frame class
<- data.frame(ess_df_wc)
ess_df_wc
<- amelia(ess_df_wc, # original dataset with missing
a.out m = 15, # number of m "completed" data sets
seed = 23 # set the seed
)
Warning: There are observations in the data that are completely missing.
These observations will remain unimputed in the final datasets.
-- Imputation 1 --
1 2 3
-- Imputation 2 --
1 2 3
-- Imputation 3 --
1 2 3
-- Imputation 4 --
1 2 3
-- Imputation 5 --
1 2 3
-- Imputation 6 --
1 2 3
-- Imputation 7 --
1 2 3
-- Imputation 8 --
1 2 3
-- Imputation 9 --
1 2 3
-- Imputation 10 --
1 2 3
-- Imputation 11 --
1 2 3
-- Imputation 12 --
1 2 3
-- Imputation 13 --
1 2 3
-- Imputation 14 --
1 2 3
-- Imputation 15 --
1 2 3
summary(a.out)
Amelia output with 15 imputed datasets.
Return code: 1
Message: Normal EM convergence.
Chain Lengths:
--------------
Imputation 1: 3
Imputation 2: 3
Imputation 3: 3
Imputation 4: 3
Imputation 5: 3
Imputation 6: 3
Imputation 7: 3
Imputation 8: 3
Imputation 9: 3
Imputation 10: 3
Imputation 11: 3
Imputation 12: 3
Imputation 13: 3
Imputation 14: 3
Imputation 15: 3
Rows after Listwise Deletion: 1679
Rows after Imputation: 1758
Patterns of missingness in the data: 35
Fraction Missing for original variables:
-----------------------------------------
Fraction Missing
sbstrec 0.014204545
sbbsntx 0.026136364
sbprvpv 0.011363636
sbeqsoc 0.006818182
sbcwkfm 0.014204545
sblazy 0.005113636
sblwcoa 0.005113636
sblwlka 0.007954545
# we can check each "completed" dataset against our original data
cbind(ess_df_wc$sbstrec, a.out$imputations$imp1$sbstrec)[c(75:85),]
[,1] [,2]
[1,] 4 4.000000
[2,] 4 4.000000
[3,] 2 2.000000
[4,] 2 2.000000
[5,] 3 3.000000
[6,] NA 3.712241
[7,] 2 2.000000
[8,] 4 4.000000
[9,] NA 3.200630
[10,] NA 3.201157
[11,] 3 3.000000
# we need to use semTools to fit a lavaan model to multiple imputed data sets
<- semTools::runMI(
out.mi model = model_wc, # model
data = a.out$imputations, # list of imputed data sets
fun = "cfa", # lavaan function
estimator = "MLR" # estimator
)
summary(out.mi,
fit.measures=TRUE,
standardized=TRUE)
lavaan.mi object based on 15 imputed data sets.
See class?lavaan.mi help page for available methods.
Convergence information:
The model converged on 15 imputed data sets
Rubin's (1987) rules were used to pool point and SE estimates across 15 imputed data sets, and to calculate degrees of freedom for each parameter's t test and CI.
Model Test User Model:
Test statistic 75.588 67.286
Degrees of freedom 17 17
P-value 0.000 0.000
Scaling correction factor 1.123
Model Test Baseline Model:
Test statistic 2497.913 1961.807
Degrees of freedom 28 28
P-value 0.000 0.000
Scaling correction factor 1.273
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.976 0.974
Tucker-Lewis Index (TLI) 0.961 0.957
Robust Comparative Fit Index (CFI) 0.977
Robust Tucker-Lewis Index (TLI) 1.000
Root Mean Square Error of Approximation:
RMSEA 0.044 0.041
90 Percent confidence interval - lower 0.034 0.033
90 Percent confidence interval - upper 0.055 0.054
P-value RMSEA <= 0.05 0.809 0.931
Robust RMSEA 0.043
90 Percent confidence interval - lower 0.033
90 Percent confidence interval - upper 0.055
Standardized Root Mean Square Residual:
SRMR 0.032 0.032
Parameter Estimates:
Standard errors Sandwich
Information bread Observed
Observed information based on Hessian
Latent Variables:
Estimate Std.Err t-value df P(>|t|) Std.lv Std.all
wc_econo =~
sbstrec 1.000 0.622 0.610
sbbsntx 1.011 0.099 10.159 Inf 0.000 0.629 0.630
wc_socia =~
sbprvpv 1.000 0.472 0.542
sbeqsoc 1.416 0.254 5.579 Inf 0.000 0.669 0.803
sbcwkfm 0.628 0.063 9.920 Inf 0.000 0.297 0.382
wc_moral =~
sblazy 1.000 0.747 0.708
sblwcoa 1.066 0.046 23.366 Inf 0.000 0.796 0.770
sblwlka 0.975 0.046 21.175 Inf 0.000 0.728 0.704
Covariances:
Estimate Std.Err t-value df P(>|t|) Std.lv Std.all
wc_econo ~~
wc_socia -0.020 0.017 -1.184 Inf 0.236 -0.067 -0.067
wc_moral 0.251 0.025 10.044 Inf 0.000 0.540 0.540
wc_socia ~~
wc_moral -0.034 0.013 -2.645 Inf 0.008 -0.096 -0.096
Variances:
Estimate Std.Err t-value df P(>|t|) Std.lv Std.all
.sbstrec 0.652 0.047 13.818 Inf 0.000 0.652 0.627
.sbbsntx 0.600 0.043 13.880 Inf 0.000 0.600 0.603
.sbprvpv 0.537 0.044 12.142 Inf 0.000 0.537 0.706
.sbeqsoc 0.247 0.078 3.165 Inf 0.002 0.247 0.356
.sbcwkfm 0.515 0.026 19.883 Inf 0.000 0.515 0.854
.sblazy 0.555 0.032 17.216 Inf 0.000 0.555 0.498
.sblwcoa 0.437 0.031 14.214 Inf 0.000 0.437 0.408
.sblwlka 0.540 0.031 17.193 Inf 0.000 0.540 0.505
wc_econo 0.387 0.047 8.184 Inf 0.000 1.000 1.000
wc_socia 0.223 0.045 4.921 Inf 0.000 1.000 1.000
wc_moral 0.558 0.038 14.849 Inf 0.000 1.000 1.000
# let's compare the fit of the different models
<- function(lavobject) {
model_fit <- c("chisq", "df", "cfi", "tli", "rmsea", "rmsea.ci.lower", "rmsea.ci.upper", "rmsea.pvalue", "srmr")
vars return(fitmeasures(lavobject)[vars] %>% data.frame() %>% round(2) %>% t())
}
<-
table_fit list(model_fit(fit_wc_6_listwise),
model_fit(fit_wc_6_fiml),
model_fit(out.mi)) %>%
reduce(rbind)
rownames(table_fit) <- c("Listwise", "FIML", "Amelia")
table_fit
chisq df cfi tli rmsea rmsea.ci.lower rmsea.ci.upper rmsea.pvalue srmr
Listwise 71.35 17 0.98 0.96 0.04 0.03 0.05 0.83 0.03
FIML 75.09 17 0.98 0.96 0.04 0.03 0.05 0.82 0.03
Amelia 75.59 17 0.98 0.96 0.04 0.03 0.05 0.81 0.03
Q: The fit statistics are similar across the models? Why?
Notes:
- Binary variables: you can specify the nominals option in Amelia function
- Ordinal variables: “Users are advised to allow Amelia to impute non-integer values for any missing data, and to use these non-integer values in their analysis. Sometimes this makes sense, and sometimes this defies intuition.”
- Variables to include in the imputation model: “When performing multiple imputation, the first step is to identify the variables to include in the imputation model. It is crucial to include at least as much information as will be used in the analysis model. That is, any variable that will be in the analysis model should also be in the imputation model. This includes any transformations or interactions of variables that will appear in the analysis model.”