Introduction To Structural Equation Modeling Using Stata PDF Free Download

1y ago

34 Views

1 Downloads

3.05 MB

127 Pages

Report/dmca

Download PDF

Transcription

Introduction to Structural EquationModeling Using StataChuck HuberStataCorpchuber@stata.comUniversity College LondonOctober 16, 2019

Outline What is structural equation modeling?Structural equation modeling in StataContinuous outcome models using semMultilevel generalized models using gsemDemonstrations and Questions

What is Structural Equation Modeling? Brief historyPath diagramsKey concepts, jargon and assumptionsAssessing model fitThe process of SEM

Brief History of SEM Factor analysis had its roots in psychology.– Charles Spearman (1904) is credited with developing the common factormodel. He proposed that correlations between tests of mental abilitiescould be explained by a common factor representing ability.– In the 1930s, L. L. Thurston, who was also active in psychometrics,presented work on multiple factor models. He disagreed with the idea of aone general intelligence factor underlying all test scores. He also used anoblique rotation, allowing the factors to be correlated.– In 1956, T.W. Anderson and H. Rubin discussed testing in factor analysis,and Jöreskog (1969) introduced confirmatory factor analysis andestimation via maximum likelihood estimation, allowing for testing ofhypothesis about the number of factors and how they relate to observedvariables.

Brief History of SEM Path analysis and systems of simultaneous equationsdeveloped in genetics, econometrics, and latersociology.– Sewall Wright, a geneticist, is credited with developing path analysis. Hisfirst paper using this method was published in 1918 where he looked atgenetic causes related to bone sizes in rabbits. Rather than estimating onlythe correlation between variables, he created path diagrams to thatshowed presumed causal paths between variables. He compared what thecorrelations should be if the variables had the presumed relationships tothe observed correlations to evaluate his assumptions.– In the 1930s, 1940s, and 1950s, many economists including Haavelmo(1943) and Koopmans (1945) worked with systems of simultaneousequations. Economists also introduced a variety of estimation methodsand investigated identification issues.– In the 1960, sociologists including Blalock and Duncan applied pathanalysis to their research.

Brief History of SEM In the early 1970s, these two methods merged.– Hauser and Goldberger (1971) worked on including unobservables intopath models.– Jöreskog (1973) developed a general model for fitting systems of linearequations and for including latent variables. He also developed themethodology for fitting these models using maximum likelihood estimationand created the program LISREL.– Keesling (1972) and Wiley (1973) also worked with the general frameworkcombining the two methods. Much work has been done since then in to extend thesemodels, to evaluate identification, to test model fit, andmore.

What is Structural Equation Modeling? Structural equation modeling encompasses abroad array of models from linear regression tomeasurement models to simultaneous equations. Structural equation modeling is not just anestimation method for a particular model. Structural equation modeling is a way of thinking,a way of writing, and a way of estimating.-Stata SEM Manual, pg 2

What is Structural Equation Modeling? SEM is a class of statistical techniques that allows us to testhypotheses about relationships among variables. SEM may also be referred to as Analysis of CovarianceStructures. SEM fits models using the observed covariancesand, possibly, means. SEM encompasses other statistical methods such ascorrelation, linear regression, and factor analysis. SEM is a multivariate technique that allows us to estimate asystem of equations. Variables in these equations may bemeasured with error. There may be variables in the modelthat cannot be measured directly.

Structural Equation Models areoften drawn as Path Diagrams:

Jargon Observed and Latent variablesPaths and CovarianceEndogenous and Exogenous variablesRecursive and Nonrecursive models

Observed and Latent Variables Observed variables are variablesthat are included in our dataset.They are represented by rectangles.The variables satv, satq, and hsgpaare observed variables in this pathdiagram. Latent variables are unobservedvariables that we wish we hadobserved. They can be thought of asthe underlying cause of theobserved variables. They arerepresented by ovals. The variableAptitude is a latent variable in thispath diagram.

Paths and Covariance Paths are direct relationships between variables. Estimated pathcoefficients are analogous to regression coefficients. They arerepresented by straight arrows. Covariance specify that two latent variables or error termscovary. They are represented by curved arrows.

Exogenous and Endogenous Variables Exogenous variables are determined outside the system ofequations. There are no paths pointing to it. The variables satv,satq, hsgpa, and credithrs are exogenous. Endogenous variables are determined by the system ofequations. At least one path points to it. The variablesscholarships and fygpa are endogenous.

Observed Exogenous: a variable in a datasetthat is treated as exogenous in the model Latent Exogenous: an unobserved variablethat is treated as exogenous in the model. Observed Endogenous: a variable in a datasetthat is treated as endogenous in the model Latent Endogenous: an unobserved variablethat is treated as endogenous in the model.

Recursive and Nonrecursive Systems Recursive models do not have any feedback loops or correlatederrors. Nonrecursive models have feedback loops or correlated errors.These models have paths in both directions between one ormore pairs of endogenous variables

Notation Observed endogenous: yObserved exogenous: xLatent endogenous: ηLatent exogenous: ξError of observed endogenous: e.yError of latent endogenous: e.ηAll endogenous: Y y ηAll exogenous: X x ξAll error: e.y e.η

𝑌 𝐵𝑌 Γ𝑋 𝛼 𝜁We estimate: The coefficients B and 𝚪The intercepts, 𝜶The means of the exogenous variables 𝜿 𝐸(𝑿)The variances and covariances of the exogenousvariables, 𝜱 𝑉𝑎𝑟(𝑿) The variances and covariances of the errors 𝚿 𝑉𝑎𝑟(𝜻)

Assumptions Large Sample Size Multivariate Normality Correct Model Specification

Assumptions Large Sample Size– ML estimation relies on asymptotics, and large samplesizes are needed to obtain reliable parameter estimates.– Different suggestions regarding appropriate sample sizehave been given by different authors.– A common rule of thumb is to have a sample size of morethan 200, although sometimes 100 is seen as adequate.– Other authors propose sample sizes relative to the numberof parameters being estimated. Ratios of observations tofree parameters from 5:1 up to 20:1 have been proposed.

Assumptions Multivariate Normality– The likelihood that is maximized when fittingstructural equation models using ML is derivedunder the assumption that the observed variablesfollow a multivariate normal distribution.– The assumption of multivariate normality canoften be relaxed, particularly for exogenousvariables.

Assumptions Correct Model Specification– SEM assumes that no relevant variables are omittedfrom any equation in the model.– Omitted variable bias can arise in linear regression ifan independent variable is omitted from the modeland the omitted variable is correlated with otherindependent variables.– When fitting structural equation models with ML andall equations are fit jointly, errors can occur inequations other than the one with the omittedvariable.

What is Structural Equation Modeling? Brief historyPath diagramsKey concepts, jargon and assumptionsAssessing model fitThe process of SEM

Assessing Model Goodness of Fit Model Definitions– The Saturated Model assumes that all variablesare correlated.– The Baseline Model assumes that no variablesare correlated (except for observed exogenousvariables when endogenous variables arepresent).– The Specified Model is the model that we fit

Likelihood Ratio 𝜒 2 (baseline vs saturated models)2𝜒𝑏𝑠 2 𝑙𝑜𝑔 𝐿𝑠 𝑙𝑜𝑔𝐿𝑏Likelihood Ratio 𝜒 2 (specified vs saturated models)2𝜒𝑚𝑠 2 𝑙𝑜𝑔 𝐿𝑠 𝑙𝑜𝑔𝐿𝑚where:𝐿𝑏 is the loglikelihood for the baseline model𝐿𝑠 is the loglikelihood for the saturated model𝐿𝑚 is the loglikelihood for the specified model𝑑𝑓𝑏𝑠 𝑑𝑓𝑠 𝑑𝑓𝑏𝑑𝑓𝑚𝑠 𝑑𝑓𝑠 𝑑𝑓𝑚

Assessing Model Goodness of Fit 2 )Likelihood Ratio Chi-squared Test (𝜒𝑚𝑠Akaike’s Information Criterion (AIC)Swartz’s Bayesian Information Criterion (BIC)Coefficient of Determination (𝑅2 )Root Mean Square Error of Approximation (RMSEA)Comparative Fit Index (CFI)Tucker-Lewis Index (TLI)Standardized Root Mean Square Residual (SRMR)Satorra-Bentler adjustmentSee also: http://davidakenny.net/cm/fit.htm

Assessing Model Goodness of FitLikelihood Ratio 𝜒 2 (baseline vs saturated models)2𝜒𝑏𝑠 2 𝑙𝑜𝑔 𝐿𝑠 𝑙𝑜𝑔𝐿𝑏where:𝐿𝑠 is the loglikelihood for the saturated model𝐿𝑚 is the loglikelihood for the specified model𝑑𝑓𝑚𝑠 𝑑𝑓𝑠 𝑑𝑓𝑚Good fit indicated by: p-value 0.05

Assessing Model Goodness of FitAkaike’s Information Criterion (AIC)AIC 2 𝑙𝑜𝑔 𝐿𝑚 2𝑑𝑓𝑚Swartz’s Bayesian Information Criterion (BIC)BIC 2 𝑙𝑜𝑔 𝐿𝑚 ln(𝑁)𝑑𝑓𝑚Good fit indicated by: Used for comparing two models Smaller (in absolute value) is better

Assessing Model Goodness of FitCoefficient of Determination (𝑅2 ) 𝑑𝑒𝑡Ψ2𝑅 1 𝑑𝑒𝑡 Σ Good fit indicated by: Values closer to 1 indicate good fit

Assessing Model Goodness of Fit Root Mean Square Error of Approximation Compares the current model with the saturated model The null hypothesis is that the model fits𝑅𝑀𝑆𝐸𝐴 2𝜒𝑚𝑠 𝑑𝑓𝑚𝑠𝑁 1 𝑑𝑓𝑚𝑠Good fit indicated by: Hu and Bentler (1999): RMSEA 0.06 Browne and Cudeck (1993) Good Fit (RMSEA 0.05) Adequate Fit (RMSEA between 0.05 and 0.08) Poor Fit (RMSEA 0.1) P-value 0.05

Assessing Model Goodness of Fit Comparative Fit Index (CFI) Compares the current model with the baseline model2𝜒𝑚𝑠 𝑑𝑓𝑚𝑠𝐶𝐹𝐼 1 2𝜒𝑏𝑠 𝑑𝑓𝑏𝑠Good fit indicated by: CFI 0.95 (sometimes 0.90)

Assessing Model Goodness of FitTucker-Lewis Index (TLI) Compares the current model with the baseline model2 Τ2 Τ𝜒𝑏𝑠𝑑𝑓𝑏𝑠 𝜒𝑚𝑠𝑑𝑓𝑚𝑠𝑇𝐿𝐼 1 2 Τ𝜒𝑏𝑠𝑑𝑓𝑏𝑠 1Good fit indicated by: TLI 0.95

Assessing Model Goodness of FitStandardized Root Mean Square Residual (SRMR) SRMR is a measure of the average difference betweenthe observed and model implied correlations. This willbe close to 0 when the model fits well. Hu and Bentler(1999) suggest values close to .08 or below.Good fit indicated by: SRMR 0.08

The Process of SEM Specify the modelFit the modelEvaluate the modelModify the modelInterpret and report the results

Outline What is structural equation modeling?Structural equation modeling in StataContinuous outcome models using semMultilevel generalized models using gsemDemonstrations and Questions