Structural Equation Modeling | Lab Session 4

1 What we are going to cover

  1. Non-normal continuous data
  2. Categorical data
  3. Missing data

2 Data

The data set used throughout is the European Social Survey ESS4-2008 Edition 4.5 was released on 1 December 2018. We will restrict the analysis to the Belgian case. Each line in the data set represents a Belgian respondent. The full dataset an documentation can be found on the ESS website

Codebook:

  • gvslvol Standard of living for the old, governments’ responsibility (0 Not governments’ responsibility at all - 10 Entirely governments’ responsibility)

  • gvslvue Standard of living for the unemployed, governments’ responsibility (0 Not governments’ responsibility at all - 10 Entirely governments’ responsibility)

  • gvhlthc Health care for the sick, governments’ responsibility (0 Not governments’ responsibility at all - 10 Entirely governments’ responsibility)

  • gvcldcr Child care services for working parents, governments’ responsibility (0 Not governments’ responsibility at all - 10 Entirely governments’ responsibility)

  • gvjbevn Job for everyone, governments’ responsibility (0 Not governments’ responsibility at all - 10 Entirely governments’ responsibility)

  • gvpdlwk Paid leave from work to care for sick family, governments’ responsibility (0 Not governments’ responsibility at all - 10 Entirely governments’ responsibility)

  • sbstrec Social benefits/services place too great strain on economy (1 Agree strongly - 5 Disagree strongly)

  • sbbsntx Social benefits/services cost businesses too much in taxes/charges (1 Agree strongly - 5 Disagree strongly)

  • sbprvpv Social benefits/services prevent widespread poverty (1 Agree strongly - 5 Disagree strongly)

  • sbeqsoc Social benefits/services lead to a more equal society (1 Agree strongly - 5 Disagree strongly)

  • sbcwkfm Social benefits/services make it easier to combine work and family (1 Agree strongly - 5 Disagree strongly)

  • sblazy Social benefits/services make people lazy (1 Agree strongly - 5 Disagree strongly)

  • sblwcoa Social benefits/services make people less willing care for one another (1 Agree strongly - 5 Disagree strongly)

  • sblwlka Social benefits/services make people less willing look after themselves/family (1 Agree strongly - 5 Disagree strongly)

  • agea Respondent’s age

  • eduyrs Years of full-time education completed

  • gndr Gender (1 Male, 2 Female)

  • hinctnta Household’s total net income, all sources (Deciles of the actual household income range in the given country.)

  • gincdif Government should reduce differences in income levels (1 Agree strongly - 5 Disagree strongly)

  • dfincac Large differences in income acceptable to reward talents and efforts (1 Agree strongly - 5 Disagree strongly)

  • smdfslv For fair society, differences in standard of living should be small (1 Agree strongly - 5 Disagree strongly)

3 Environment preparation

First, let’s load the necessary packages to load, manipulate, visualize and analyse the data.

# Uncomment this once if you need to install the packages on your system 

### DATA MANIPULATION ###
# install.packages("haven")                 # data import from spss
# install.packages("dplyr")                 # data manipulation
# install.packages("psych")                 # descriptives
# install.packages("stringr")               # string manipulation
# install.packages("purrr")                 # table manipulation 

### MODELING ###
# install.packages("lavaan")                # SEM modelling

#### FOR MVN we need install.packages("gsl"). 
#### ON mac with homebrewed packaged:
#### CFLAGS="-I/opt/homebrew/Cellar/gsl/2.7.1/include" LDFLAGS="-L/opt/homebrew/Cellar/gsl/2.7.1/lib -lgsl -lgslcblas" R
#### install.packages("gsl")

# install.packages("MVN")                   # tests for multivariate normality
# install.packages("Amelia")                # performing multiple imputation
# install.packages("semTools")              # fit a model to multiple imputed data set

### VISUALIZATION ###
# install.packages("tidySEM")               # plotting SEM models
# install.packages("ggplot2")               # plotting 
# install.packages("patchwork")             # organization plots

# Load the packages 

### DATA MANIPULATION ###
library("haven")        
library("dplyr")      
library("psych")
library("stringr")
library("purrr")

### MODELING ###
library("lavaan")       
library("MVN")
library("Amelia")
#library("semTools")                      # loaded later for compatibility issues

### VISUALIZATION ###
library("tidySEM")
library("ggplot2")              
library("patchwork")    

4 Non-normal, continuous data

Special data analytic techniques for non-normal continuous variables should be used if any of the continuous variables in the model are non-normal. maximum likelihood (ML) with non normal continuous variables results in biased standard errors and model fit deterioration. Practically, this means that that chi-square will be erroneously too large and standard errors are too small increasing the probability of Type-1 errors. The parameter estimates (e.g., loadings, regression coefficients) are largely unaffected (Finch, West, and MacKinnon 1997).

4.1 Detecting non-normality

Unadjusted ML analysis approach most commonly employed for SEM assumes multivariate normality. One first step to assess multivariate normality consist in checking if the univariate distributions are normal. If the univariate distributions are non-normal, then the multivariate distribution will be non normal. We can plot the univariate distribution of our variables and perform a Kolmogorov-Smirnov (KS) test.

ess_df <- haven::read_sav("https://github.com/albertostefanelli/SEM_labs/raw/master/data/ESS4_belgium.sav")

# We are going to measure welfare support using 4 items: gvslvol gvhlthc gvcldcr gvpdlwk
# First let's visually inspect our variables 

h_gvslvol <- ggplot(ess_df, aes(gvslvol)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", alpha=0.3)

h_gvhlthc <- ggplot(ess_df, aes(gvhlthc)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", alpha=0.3)

h_gvcldcr <- ggplot(ess_df, aes(gvcldcr)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", alpha=0.3)

h_gvpdlwk <- ggplot(ess_df, aes(gvpdlwk)) +
        geom_blank() +
        geom_histogram(aes(y = ..density..), binwidth = 1, colour = "black", alpha=0.3)

h_gvslvol + h_gvhlthc + h_gvcldcr + h_gvpdlwk

We can check if a distribution is normal (or almost) using Kolmogorov-Smirnov (KS) test. The KS test returns a D statistic and a p-value corresponding to the D statistic. The D statistic is the absolute maximum distance between the cumulative distribution function of our variable and a randomly generated normally distributed variable with the same mean and standard deviation. The closer D is to 0 the more the distribution of our variable resemble a normal distribution.

ks_gvslvol <- ks.test(ess_df$gvslvol, "pnorm", mean=mean(ess_df$gvslvol, na.rm=T), sd=sd(ess_df$gvslvol, na.rm=T))
ks_gvhlthc <- ks.test(ess_df$gvhlthc, "pnorm", mean=mean(ess_df$gvhlthc, na.rm=T), sd=sd(ess_df$gvhlthc, na.rm=T))
ks_gvcldcr <- ks.test(ess_df$gvcldcr, "pnorm", mean=mean(ess_df$gvcldcr, na.rm=T), sd=sd(ess_df$gvcldcr, na.rm=T))
ks_gvpdlwk <- ks.test(ess_df$gvpdlwk, "pnorm", mean=mean(ess_df$gvpdlwk, na.rm=T), sd=sd(ess_df$gvpdlwk, na.rm=T))

data.frame(Variables= c("gvslvol","gvhlthc","gvcldcr", "gvpdlwk"),
          D = round(c(ks_gvslvol$statistic, ks_gvhlthc$statistic, ks_gvcldcr$statistic, ks_gvpdlwk$statistic),2),
          "P-value" = c(ks_gvslvol$p.value, ks_gvhlthc$p.value, ks_gvcldcr$p.value, ks_gvpdlwk$p.value)
           )
  Variables    D P.value
1   gvslvol 0.19       0
2   gvhlthc 0.20       0
3   gvcldcr 0.17       0
4   gvpdlwk 0.17       0

We can also also have multivariate non normality even when all the individual variables are normally distributed (although severe multivariate non normality is probably less likely). There are different multivariate normality tests (i.e., Mardia, Royston, Doornik-Hansen), we are going to use the Henze-Zirkler’s multivariate normality test. For multivariate normality, the test p-value should be greater than 0.05

# select the ws items
ess_df_ws <- ess_df[,c("gvslvol", "gvhlthc", "gvcldcr", "gvpdlwk")]
# remove NAs
ess_df_ws_na <- na.omit(ess_df_ws)

mvn_test <- mvn(data = ess_df_ws_na, # our data without NAs
                mvnTest = c("hz")    # type of normality test to perform
                )

mvn_test$multivariateNormality
           Test       HZ p value MVN
1 Henze-Zirkler 27.21174       0  NO

To mitigate non-normality we can use scaled \({\chi^2}\) and “robust” standard errors corrections to ML estimation as in Satorra and Bentler (1988; 1994). Adjustments are made to the \({\chi^2}\) (and \({\chi^2}\) based fit indices) and standard errors based on a weight matrix derived from an estimate of multivariate kurtosis (as said before, the parameter estimates themselves are not altered).

model_ws  <-'
ws =~   gvslvol + # Standard of living for the old
        gvhlthc + # Health care for the sick
        gvcldcr + # Child care services for working parents
        gvpdlwk   # Paid leave from work to care for sick family

'

fit_ws_ml <- cfa(model_ws,          # model formula
                  data = ess_df,    # data frame
                  estimator = "ML"  # select the estimator 
)

fit_ws_mlr <- cfa(model_ws,          # model formula
                  data = ess_df,     # data frame
                  estimator = "MLM"  # select the estimator 
)

summary(fit_ws_mlr,
        fit.measures=TRUE)
lavaan 0.6-10 ended normally after 23 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         8
                                                      
                                                  Used       Total
  Number of observations                          1745        1760
                                                                  
Model Test User Model:
                                              Standard      Robust
  Test Statistic                               266.283     106.210
  Degrees of freedom                                 2           2
  P-value (Chi-square)                           0.000       0.000
  Scaling correction factor                                  2.507
       Satorra-Bentler correction                                 

Model Test Baseline Model:

  Test statistic                              1921.621    1147.034
  Degrees of freedom                                 6           6
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.675

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.862       0.909
  Tucker-Lewis Index (TLI)                       0.586       0.726
                                                                  
  Robust Comparative Fit Index (CFI)                         0.863
  Robust Tucker-Lewis Index (TLI)                            0.590

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -12411.000  -12411.000
  Loglikelihood unrestricted model (H1)             NA          NA
                                                                  
  Akaike (AIC)                               24837.999   24837.999
  Bayesian (BIC)                             24881.715   24881.715
  Sample-size adjusted Bayesian (BIC)        24856.300   24856.300

Root Mean Square Error of Approximation:

  RMSEA                                          0.275       0.173
  90 Percent confidence interval - lower         0.248       0.155
  90 Percent confidence interval - upper         0.304       0.191
  P-value RMSEA <= 0.05                          0.000       0.000
                                                                  
  Robust RMSEA                                               0.274
  90 Percent confidence interval - lower                     0.231
  90 Percent confidence interval - upper                     0.319

Standardized Root Mean Square Residual:

  SRMR                                           0.079       0.079

Parameter Estimates:

  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  ws =~                                               
    gvslvol           1.000                           
    gvhlthc           0.921    0.044   20.858    0.000
    gvcldcr           0.852    0.052   16.287    0.000
    gvpdlwk           0.876    0.051   17.080    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .gvslvol           0.846    0.080   10.619    0.000
   .gvhlthc           1.024    0.084   12.134    0.000
   .gvcldcr           2.126    0.134   15.832    0.000
   .gvpdlwk           2.055    0.138   14.915    0.000
    ws                1.350    0.103   13.124    0.000

5 Categorical data

  1. binary: male/female, dead/alive
  2. nominal with \(K>2\): race (black, latino, asian)
  3. ordinal: ses (high, middle, low), likert scales (agree strongly, agree, neutral, disagree, disagree strongly)
  4. counts: number of deadly accidents in a year

Lavaan can deal only with binary and ordinal endogenous variables. There are two approaches to categorical variables in SEM. Only the three-stage WLS approach is well implemented in Lavaan and includes ‘robust’ variants (the other approach is called full information approach and the estimator is reffered to marginal maximum likelihood). Let’s use the social (sbprvpv, sbeqsoc, sbcwkfm) and moral (sblazy, sblwcoa, sblwlka) criticism items to estimate a model with two first-order ordered latent variables.

Stages of WLS:

  1. An observed variable \(y\) can be seen as a latent continuous variable \(y^*\), in our social criticism example an ordinal variable with K = 5 response categories. The model estimates thresholds using univariate information only. This is part of the mean structure of the model.
  2. Keeping the thresholds fixed, the model estimates the correlations. Only the correlation matrix is used, rather than covariance matrix. In our example with ordinal data, the model estimates polychoric correlations.
  3. The full SEM model is estimated using weighted least squares. If exogenous covariates are involved (i.e. regression paths), the correlations are adjusted on based on the values of \(y^*\). Residual variances of \(y^*\) is estimated (called “scale parameters”)

model_wc <-'
## Social criticism ## 
wc_socia =~ sbprvpv + sbeqsoc + sbcwkfm
##  Moral criticism ##
wc_moral =~ sblazy + sblwcoa + sblwlka
'

fit_wc_wlsmv <- cfa(model_wc, 
                    ordered = c("sbprvpv",
                                "sbeqsoc",
                                "sbcwkfm",
                                "sblazy",
                                "sblwcoa",
                                "sblwlka" ),
                    data = ess_df,
                    estimator = "WLSMV"
)
summary(fit_wc_wlsmv, standardized=TRUE)
lavaan 0.6-10 ended normally after 23 iterations

  Estimator                                       DWLS
  Optimization method                           NLMINB
  Number of model parameters                        31
                                                      
                                                  Used       Total
  Number of observations                          1711        1760
                                                                  
Model Test User Model:
                                              Standard      Robust
  Test Statistic                                21.993      29.388
  Degrees of freedom                                 8           8
  P-value (Chi-square)                           0.005       0.000
  Scaling correction factor                                  0.783
  Shift parameter                                            1.287
       simple second-order correction                             

Parameter Estimates:

  Standard errors                           Robust.sem
  Information                                 Expected
  Information saturated (h1) model        Unstructured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  wc_socia =~                                                           
    sbprvpv           1.000                               0.610    0.610
    sbeqsoc           1.443    0.120   11.979    0.000    0.880    0.880
    sbcwkfm           0.691    0.044   15.740    0.000    0.422    0.422
  wc_moral =~                                                           
    sblazy            1.000                               0.736    0.736
    sblwcoa           1.132    0.027   42.700    0.000    0.833    0.833
    sblwlka           1.015    0.024   42.012    0.000    0.747    0.747

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  wc_socia ~~                                                           
    wc_moral         -0.041    0.014   -2.890    0.004   -0.090   -0.090

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .sbprvpv           0.000                               0.000    0.000
   .sbeqsoc           0.000                               0.000    0.000
   .sbcwkfm           0.000                               0.000    0.000
   .sblazy            0.000                               0.000    0.000
   .sblwcoa           0.000                               0.000    0.000
   .sblwlka           0.000                               0.000    0.000
    wc_socia          0.000                               0.000    0.000
    wc_moral          0.000                               0.000    0.000

Thresholds:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    sbprvpv|t1       -1.221    0.040  -30.423    0.000   -1.221   -1.221
    sbprvpv|t2        0.528    0.032   16.562    0.000    0.528    0.528
    sbprvpv|t3        1.111    0.038   29.093    0.000    1.111    1.111
    sbprvpv|t4        2.213    0.081   27.384    0.000    2.213    2.213
    sbeqsoc|t1       -1.420    0.045  -31.912    0.000   -1.420   -1.420
    sbeqsoc|t2        0.487    0.032   15.375    0.000    0.487    0.487
    sbeqsoc|t3        1.150    0.039   29.605    0.000    1.150    1.150
    sbeqsoc|t4        2.165    0.077   28.025    0.000    2.165    2.165
    sbcwkfm|t1       -1.462    0.046  -32.072    0.000   -1.462   -1.462
    sbcwkfm|t2        0.538    0.032   16.846    0.000    0.538    0.538
    sbcwkfm|t3        1.279    0.041   30.976    0.000    1.279    1.279
    sbcwkfm|t4        2.375    0.095   25.029    0.000    2.375    2.375
    sblazy|t1        -1.458    0.045  -32.058    0.000   -1.458   -1.458
    sblazy|t2        -0.204    0.031   -6.690    0.000   -0.204   -0.204
    sblazy|t3         0.428    0.031   13.659    0.000    0.428    0.428
    sblazy|t4         1.631    0.051   32.211    0.000    1.631    1.631
    sblwcoa|t1       -1.573    0.049  -32.255    0.000   -1.573   -1.573
    sblwcoa|t2       -0.198    0.031   -6.497    0.000   -0.198   -0.198
    sblwcoa|t3        0.355    0.031   11.455    0.000    0.355    0.355
    sblwcoa|t4        1.747    0.055   31.853    0.000    1.747    1.747
    sblwlka|t1       -1.548    0.048  -32.245    0.000   -1.548   -1.548
    sblwlka|t2       -0.275    0.031   -8.955    0.000   -0.275   -0.275
    sblwlka|t3        0.272    0.031    8.859    0.000    0.272    0.272
    sblwlka|t4        1.804    0.057   31.554    0.000    1.804    1.804

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .sbprvpv           0.628                               0.628    0.628
   .sbeqsoc           0.225                               0.225    0.225
   .sbcwkfm           0.822                               0.822    0.822
   .sblazy            0.459                               0.459    0.459
   .sblwcoa           0.306                               0.306    0.306
   .sblwlka           0.442                               0.442    0.442
    wc_socia          0.372    0.036   10.462    0.000    1.000    1.000
    wc_moral          0.541    0.020   27.352    0.000    1.000    1.000

Scales y*:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
    sbprvpv           1.000                               1.000    1.000
    sbeqsoc           1.000                               1.000    1.000
    sbcwkfm           1.000                               1.000    1.000
    sblazy            1.000                               1.000    1.000
    sblwcoa           1.000                               1.000    1.000
    sblwlka           1.000                               1.000    1.000

6 Missing data

Different types of missingness:

  1. Missing completely at random (MCAR) if the events that lead to any particular data-item being missing are independent both of observable variables and of unobservable parameters of interest, and occur entirely at random. When data are MCAR, the analyses performed on the data are unbiased; however, data are rarely MCAR.
  2. Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables on which there is complete information. MAR is an assumption that is impossible to verify statistically, we must rely on its substantive reasonableness.
  3. Missing not at random (MNAR) (also known as nonignorable nonresponse) is data that is neither MAR nor MCAR (i.e. the value of the variable that’s missing is related to the reason it’s missing).

Three different approaches to missing data (See Brown, Chapter 9)

  1. Default behaviour: listwise deletion
  2. Best option: Case-wise (or ‘full information’) maximum likelihood (FIML). Can be employed ONLY when the missing mechanism is MCAR or MAR. A scaled (‘Yuan-Bentler’) test statistic similar to the one used for non-normal continuous data can be computed.
  3. Second best option: Multiple imputation
model_wc <-'
## Economic criticism ##
wc_econo =~ sbstrec + sbbsntx
## Social criticism ## 
wc_socia =~ sbprvpv + sbeqsoc + sbcwkfm
##  Moral criticism ##
wc_moral =~ sblazy + sblwcoa + sblwlka
'

# listwise

fit_wc_6_listwise <- cfa(model_wc,               # model formula
                        data = ess_df,           # data frame
                        missing = "listwise"     # listwise
  )

summary(fit_wc_6_listwise, standardized=TRUE)
lavaan 0.6-10 ended normally after 36 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        19
                                                      
                                                  Used       Total
  Number of observations                          1679        1760
                                                                  
Model Test User Model:
                                                      
  Test statistic                                71.348
  Degrees of freedom                                17
  P-value (Chi-square)                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  wc_econo =~                                                           
    sbstrec           1.000                               0.628    0.616
    sbbsntx           1.001    0.093   10.817    0.000    0.629    0.629
  wc_socia =~                                                           
    sbprvpv           1.000                               0.462    0.527
    sbeqsoc           1.493    0.166    9.011    0.000    0.689    0.824
    sbcwkfm           0.631    0.055   11.461    0.000    0.291    0.375
  wc_moral =~                                                           
    sblazy            1.000                               0.749    0.710
    sblwcoa           1.058    0.045   23.285    0.000    0.792    0.766
    sblwlka           0.977    0.043   22.843    0.000    0.732    0.706

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  wc_econo ~~                                                           
    wc_socia         -0.024    0.011   -2.093    0.036   -0.082   -0.082
    wc_moral          0.250    0.023   10.872    0.000    0.532    0.532
  wc_socia ~~                                                           
    wc_moral         -0.034    0.012   -2.915    0.004   -0.099   -0.099

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .sbstrec           0.647    0.042   15.381    0.000    0.647    0.621
   .sbbsntx           0.603    0.041   14.584    0.000    0.603    0.604
   .sbprvpv           0.554    0.030   18.722    0.000    0.554    0.722
   .sbeqsoc           0.224    0.051    4.404    0.000    0.224    0.320
   .sbcwkfm           0.518    0.020   25.832    0.000    0.518    0.859
   .sblazy            0.551    0.027   20.083    0.000    0.551    0.495
   .sblwcoa           0.441    0.027   16.639    0.000    0.441    0.413
   .sblwlka           0.539    0.027   20.320    0.000    0.539    0.502
    wc_econo          0.395    0.045    8.689    0.000    1.000    1.000
    wc_socia          0.213    0.029    7.328    0.000    1.000    1.000
    wc_moral          0.561    0.039   14.477    0.000    1.000    1.000
# Full information  maximum likelihood (FIML)

fit_wc_6_fiml <- cfa(model_wc,                # model formula
                     data = ess_df,           # data frame
                     missing = "direct"       # alias: "ml" or "fiml"
  )

summary(fit_wc_6_fiml, standardized=TRUE)
lavaan 0.6-10 ended normally after 45 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        27
                                                      
                                                  Used       Total
  Number of observations                          1758        1760
  Number of missing patterns                        34            
                                                                  
Model Test User Model:
                                                      
  Test statistic                                75.087
  Degrees of freedom                                17
  P-value (Chi-square)                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Observed
  Observed information based on                Hessian

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  wc_econo =~                                                           
    sbstrec           1.000                               0.622    0.610
    sbbsntx           1.011    0.092   10.931    0.000    0.628    0.631
  wc_socia =~                                                           
    sbprvpv           1.000                               0.472    0.541
    sbeqsoc           1.420    0.172    8.270    0.000    0.670    0.804
    sbcwkfm           0.627    0.053   11.752    0.000    0.296    0.381
  wc_moral =~                                                           
    sblazy            1.000                               0.747    0.708
    sblwcoa           1.065    0.045   23.481    0.000    0.795    0.769
    sblwlka           0.977    0.043   22.799    0.000    0.729    0.705

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  wc_econo ~~                                                           
    wc_socia         -0.019    0.012   -1.592    0.111   -0.065   -0.065
    wc_moral          0.250    0.023   10.878    0.000    0.539    0.539
  wc_socia ~~                                                           
    wc_moral         -0.034    0.012   -2.900    0.004   -0.095   -0.095

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .sbstrec           2.962    0.024  121.159    0.000    2.962    2.907
   .sbbsntx           2.535    0.024  105.527    0.000    2.535    2.544
   .sbprvpv           2.334    0.021  111.740    0.000    2.334    2.677
   .sbeqsoc           2.375    0.020  119.100    0.000    2.375    2.848
   .sbcwkfm           2.335    0.019  125.202    0.000    2.335    3.004
   .sblazy            2.894    0.025  114.871    0.000    2.894    2.743
   .sblwcoa           2.920    0.025  118.148    0.000    2.920    2.821
   .sblwlka           2.979    0.025  120.502    0.000    2.979    2.881
    wc_econo          0.000                               0.000    0.000
    wc_socia          0.000                               0.000    0.000
    wc_moral          0.000                               0.000    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .sbstrec           0.652    0.041   15.866    0.000    0.652    0.628
   .sbbsntx           0.598    0.041   14.681    0.000    0.598    0.602
   .sbprvpv           0.537    0.032   16.919    0.000    0.537    0.707
   .sbeqsoc           0.246    0.053    4.611    0.000    0.246    0.354
   .sbcwkfm           0.516    0.020   25.386    0.000    0.516    0.855
   .sblazy            0.556    0.027   20.343    0.000    0.556    0.499
   .sblwcoa           0.438    0.026   16.852    0.000    0.438    0.409
   .sblwlka           0.538    0.026   20.842    0.000    0.538    0.503
    wc_econo          0.387    0.044    8.737    0.000    1.000    1.000
    wc_socia          0.223    0.032    7.019    0.000    1.000    1.000
    wc_moral          0.558    0.038   14.626    0.000    1.000    1.000
# combine both robust standard errors (sandwich) and a scaled test statistic (Satorra Bentler)

fit_wc_6_fiml_mr <- cfa(model_wc,                # model formula
                     data = ess_df,              # data frame
                     missing = "direct",         # alias: "ml" or "fiml"
                     estimator = "MLR"           # Estimator: ML Robust 
  )

summary(fit_wc_6_fiml_mr, 
        fit.measures=TRUE, 
        standardized=TRUE)
lavaan 0.6-10 ended normally after 45 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        27
                                                      
                                                  Used       Total
  Number of observations                          1758        1760
  Number of missing patterns                        34            
                                                                  
Model Test User Model:
                                               Standard      Robust
  Test Statistic                                 75.087      66.620
  Degrees of freedom                                 17          17
  P-value (Chi-square)                            0.000       0.000
  Scaling correction factor                                   1.127
       Yuan-Bentler correction (Mplus variant)                     

Model Test Baseline Model:

  Test statistic                              2508.464    1964.310
  Degrees of freedom                                28          28
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.277

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.977       0.974
  Tucker-Lewis Index (TLI)                       0.961       0.958
                                                                  
  Robust Comparative Fit Index (CFI)                         0.977
  Robust Tucker-Lewis Index (TLI)                            0.963

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -17781.687  -17781.687
  Scaling correction factor                                  1.163
      for the MLR correction                                      
  Loglikelihood unrestricted model (H1)             NA          NA
  Scaling correction factor                                  1.149
      for the MLR correction                                      
                                                                  
  Akaike (AIC)                               35617.374   35617.374
  Bayesian (BIC)                             35765.116   35765.116
  Sample-size adjusted Bayesian (BIC)        35679.339   35679.339

Root Mean Square Error of Approximation:

  RMSEA                                          0.044       0.041
  90 Percent confidence interval - lower         0.034       0.031
  90 Percent confidence interval - upper         0.055       0.051
  P-value RMSEA <= 0.05                          0.817       0.937
                                                                  
  Robust RMSEA                                               0.043
  90 Percent confidence interval - lower                     0.033
  90 Percent confidence interval - upper                     0.054

Standardized Root Mean Square Residual:

  SRMR                                           0.029       0.029

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  wc_econo =~                                                           
    sbstrec           1.000                               0.622    0.610
    sbbsntx           1.011    0.101   10.050    0.000    0.628    0.631
  wc_socia =~                                                           
    sbprvpv           1.000                               0.472    0.541
    sbeqsoc           1.420    0.253    5.609    0.000    0.670    0.804
    sbcwkfm           0.627    0.063    9.900    0.000    0.296    0.381
  wc_moral =~                                                           
    sblazy            1.000                               0.747    0.708
    sblwcoa           1.065    0.046   23.372    0.000    0.795    0.769
    sblwlka           0.977    0.046   21.123    0.000    0.729    0.705

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  wc_econo ~~                                                           
    wc_socia         -0.019    0.017   -1.158    0.247   -0.065   -0.065
    wc_moral          0.250    0.025   10.032    0.000    0.539    0.539
  wc_socia ~~                                                           
    wc_moral         -0.034    0.013   -2.615    0.009   -0.095   -0.095

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .sbstrec           2.962    0.024  121.142    0.000    2.962    2.907
   .sbbsntx           2.535    0.024  105.364    0.000    2.535    2.544
   .sbprvpv           2.334    0.021  111.728    0.000    2.334    2.677
   .sbeqsoc           2.375    0.020  119.096    0.000    2.375    2.848
   .sbcwkfm           2.335    0.019  125.195    0.000    2.335    3.004
   .sblazy            2.894    0.025  114.879    0.000    2.894    2.743
   .sblwcoa           2.920    0.025  118.149    0.000    2.920    2.821
   .sblwlka           2.979    0.025  120.489    0.000    2.979    2.881
    wc_econo          0.000                               0.000    0.000
    wc_socia          0.000                               0.000    0.000
    wc_moral          0.000                               0.000    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .sbstrec           0.652    0.048   13.681    0.000    0.652    0.628
   .sbbsntx           0.598    0.044   13.742    0.000    0.598    0.602
   .sbprvpv           0.537    0.044   12.185    0.000    0.537    0.707
   .sbeqsoc           0.246    0.078    3.151    0.002    0.246    0.354
   .sbcwkfm           0.516    0.026   19.909    0.000    0.516    0.855
   .sblazy            0.556    0.032   17.227    0.000    0.556    0.499
   .sblwcoa           0.438    0.031   14.278    0.000    0.438    0.409
   .sblwlka           0.538    0.032   17.058    0.000    0.538    0.503
    wc_econo          0.387    0.048    8.093    0.000    1.000    1.000
    wc_socia          0.223    0.045    4.934    0.000    1.000    1.000
    wc_moral          0.558    0.038   14.865    0.000    1.000    1.000

FIML has lower standard errors. Why? Because FIML makes more efficient use of the data at hand and preserves statistical power (lower standard errors). This is especially true when we have a substantial amount of missings in our data. Therefore, even if your missing pattern is MCAR, it is a better option than listwise deletion.

Q: Why combining FIML with Robust Standard errors leads to an increase in SEs?

6.1 Imputing missing data

The third option is to perform multiple imputation of the missing data on our data. We can use the r package Amelia to perform multiple imputations. Amelia creates a bootstrapped version of the original data and then imputes the missing values of the original data. This involves imputing m values for each missing cell in your data matrix and creating m “completed” data sets. Across these different data sets, the observed values are the same, but the missing values are filled in with different imputations that reflect our uncertainty about the missing data.

# select the welfare criticism items
ess_df_wc <- ess_df %>% select(sbstrec, sbbsntx, sbprvpv, sbeqsoc, sbcwkfm, sblazy, sblwcoa, sblwlka)
# Amelia explicitly requires a object of data.frame class 
ess_df_wc <- data.frame(ess_df_wc)

a.out <- amelia(ess_df_wc,  # original dataset with missing
                m = 15,     # number of m "completed" data sets 
                seed = 23   # set the seed
                )
Warning: There are observations in the data that are completely missing. 
         These observations will remain unimputed in the final datasets. 
-- Imputation 1 --

  1  2  3

-- Imputation 2 --

  1  2  3

-- Imputation 3 --

  1  2  3

-- Imputation 4 --

  1  2  3

-- Imputation 5 --

  1  2  3

-- Imputation 6 --

  1  2  3

-- Imputation 7 --

  1  2  3

-- Imputation 8 --

  1  2  3

-- Imputation 9 --

  1  2  3

-- Imputation 10 --

  1  2  3

-- Imputation 11 --

  1  2  3

-- Imputation 12 --

  1  2  3

-- Imputation 13 --

  1  2  3

-- Imputation 14 --

  1  2  3

-- Imputation 15 --

  1  2  3
summary(a.out)

Amelia output with 15 imputed datasets.
Return code:  1 
Message:  Normal EM convergence. 

Chain Lengths:
--------------
Imputation 1:  3
Imputation 2:  3
Imputation 3:  3
Imputation 4:  3
Imputation 5:  3
Imputation 6:  3
Imputation 7:  3
Imputation 8:  3
Imputation 9:  3
Imputation 10:  3
Imputation 11:  3
Imputation 12:  3
Imputation 13:  3
Imputation 14:  3
Imputation 15:  3

Rows after Listwise Deletion:  1679 
Rows after Imputation:  1758 
Patterns of missingness in the data:  35 

Fraction Missing for original variables: 
-----------------------------------------

        Fraction Missing
sbstrec      0.014204545
sbbsntx      0.026136364
sbprvpv      0.011363636
sbeqsoc      0.006818182
sbcwkfm      0.014204545
sblazy       0.005113636
sblwcoa      0.005113636
sblwlka      0.007954545
# we can check each "completed" dataset against our original data 
cbind(ess_df_wc$sbstrec, a.out$imputations$imp1$sbstrec)[c(75:85),]
      [,1]     [,2]
 [1,]    4 4.000000
 [2,]    4 4.000000
 [3,]    2 2.000000
 [4,]    2 2.000000
 [5,]    3 3.000000
 [6,]   NA 3.712241
 [7,]    2 2.000000
 [8,]    4 4.000000
 [9,]   NA 3.200630
[10,]   NA 3.201157
[11,]    3 3.000000
# we need to use semTools to fit a lavaan model to multiple imputed data sets

out.mi <- semTools::runMI(
              model = model_wc,         # model 
              data = a.out$imputations, # list of imputed data sets 
              fun = "cfa",              # lavaan function 
              estimator = "MLR"         # estimator
              )


summary(out.mi,
        fit.measures=TRUE,
        standardized=TRUE)
lavaan.mi object based on 15 imputed data sets. 
See class?lavaan.mi help page for available methods. 

Convergence information:
The model converged on 15 imputed data sets 

Rubin's (1987) rules were used to pool point and SE estimates across 15 imputed data sets, and to calculate degrees of freedom for each parameter's t test and CI.

Model Test User Model:

  Test statistic                                75.588      67.286
  Degrees of freedom                                17          17
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.123

Model Test Baseline Model:

  Test statistic                              2497.913    1961.807
  Degrees of freedom                                28          28
  P-value                                        0.000       0.000
  Scaling correction factor                                  1.273

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.976       0.974
  Tucker-Lewis Index (TLI)                       0.961       0.957
                                                                  
  Robust Comparative Fit Index (CFI)                         0.977
  Robust Tucker-Lewis Index (TLI)                            1.000

Root Mean Square Error of Approximation:

  RMSEA                                          0.044       0.041
  90 Percent confidence interval - lower         0.034       0.033
  90 Percent confidence interval - upper         0.055       0.054
  P-value RMSEA <= 0.05                          0.809       0.931
                                                                  
  Robust RMSEA                                               0.043
  90 Percent confidence interval - lower                     0.033
  90 Percent confidence interval - upper                     0.055

Standardized Root Mean Square Residual:

  SRMR                                           0.032       0.032

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Latent Variables:
                   Estimate  Std.Err  t-value       df  P(>|t|)   Std.lv  Std.all
  wc_econo =~                                                                    
    sbstrec           1.000                                        0.622    0.610
    sbbsntx           1.011    0.099   10.159      Inf    0.000    0.629    0.630
  wc_socia =~                                                                    
    sbprvpv           1.000                                        0.472    0.542
    sbeqsoc           1.416    0.254    5.579      Inf    0.000    0.669    0.803
    sbcwkfm           0.628    0.063    9.920      Inf    0.000    0.297    0.382
  wc_moral =~                                                                    
    sblazy            1.000                                        0.747    0.708
    sblwcoa           1.066    0.046   23.366      Inf    0.000    0.796    0.770
    sblwlka           0.975    0.046   21.175      Inf    0.000    0.728    0.704

Covariances:
                   Estimate  Std.Err  t-value       df  P(>|t|)   Std.lv  Std.all
  wc_econo ~~                                                                    
    wc_socia         -0.020    0.017   -1.184      Inf    0.236   -0.067   -0.067
    wc_moral          0.251    0.025   10.044      Inf    0.000    0.540    0.540
  wc_socia ~~                                                                    
    wc_moral         -0.034    0.013   -2.645      Inf    0.008   -0.096   -0.096

Variances:
                   Estimate  Std.Err  t-value       df  P(>|t|)   Std.lv  Std.all
   .sbstrec           0.652    0.047   13.818      Inf    0.000    0.652    0.627
   .sbbsntx           0.600    0.043   13.880      Inf    0.000    0.600    0.603
   .sbprvpv           0.537    0.044   12.142      Inf    0.000    0.537    0.706
   .sbeqsoc           0.247    0.078    3.165      Inf    0.002    0.247    0.356
   .sbcwkfm           0.515    0.026   19.883      Inf    0.000    0.515    0.854
   .sblazy            0.555    0.032   17.216      Inf    0.000    0.555    0.498
   .sblwcoa           0.437    0.031   14.214      Inf    0.000    0.437    0.408
   .sblwlka           0.540    0.031   17.193      Inf    0.000    0.540    0.505
    wc_econo          0.387    0.047    8.184      Inf    0.000    1.000    1.000
    wc_socia          0.223    0.045    4.921      Inf    0.000    1.000    1.000
    wc_moral          0.558    0.038   14.849      Inf    0.000    1.000    1.000
# let's compare the fit of the different models
model_fit <-  function(lavobject) {
  vars <- c("chisq", "df", "cfi", "tli", "rmsea", "rmsea.ci.lower", "rmsea.ci.upper", "rmsea.pvalue", "srmr")
  return(fitmeasures(lavobject)[vars] %>% data.frame() %>% round(2) %>% t())
}

table_fit <- 
  list(model_fit(fit_wc_6_listwise), 
       model_fit(fit_wc_6_fiml), 
       model_fit(out.mi)) %>% 
  reduce(rbind)
 

rownames(table_fit) <- c("Listwise", "FIML", "Amelia")

table_fit
         chisq df  cfi  tli rmsea rmsea.ci.lower rmsea.ci.upper rmsea.pvalue srmr
Listwise 71.35 17 0.98 0.96  0.04           0.03           0.05         0.83 0.03
FIML     75.09 17 0.98 0.96  0.04           0.03           0.05         0.82 0.03
Amelia   75.59 17 0.98 0.96  0.04           0.03           0.05         0.81 0.03

Q: The fit statistics are similar across the models? Why?

Notes:

  1. Binary variables: you can specify the nominals option in Amelia function
  2. Ordinal variables: “Users are advised to allow Amelia to impute non-integer values for any missing data, and to use these non-integer values in their analysis. Sometimes this makes sense, and sometimes this defies intuition.”
  3. Variables to include in the imputation model: “When performing multiple imputation, the first step is to identify the variables to include in the imputation model. It is crucial to include at least as much information as will be used in the analysis model. That is, any variable that will be in the analysis model should also be in the imputation model. This includes any transformations or interactions of variables that will appear in the analysis model.”

7 !!Support Ukraine!!

References

Finch, John F., Stephen G. West, and David P. MacKinnon. 1997. “Effects of Sample Size and Nonnormality on the Estimation of Mediated Effects in Latent Variable Models.” Structural Equation Modeling: A Multidisciplinary Journal 4 (2): 87–107. https://doi.org/10.1080/10705519709540063.