Longitudinal Data Analysis Using Stata - Statistical Horizons

Transcription

Longitudinal Data AnalysisUsing StataPaul D. Allison, Ph.D.Upcoming Seminar:February 20-21, 2018, Stockholm, Sweden

Outline1. Opportunities and challenges of panel data.a. Data requirementsb. Control for unobservablesc. Determining causal orderd. Problem of dependencee. Software considerations2. Linear modelsa. Robust standard errorsb. Generalized estimating equationsc. Random effects modelsd. Fixed effects modelse. Between-within models3. Logistic regression modelsa. Robust standard errorsb. GEEc. Subject-specific vs. population averaged methodsd. Random effects modelse. Fixed effects modelsf. Between-within models4. Count data modelsa. Poisson modelsb. Negative binomial models5. Linear structural equation modelsa. Fixed and random effects in the SEM contextb. Models for reciprocal causation with lagged effectsPanel DataData in which variables are measured at multiple points in time for the sameindividuals.Response variable yit with t 1, 2, , TCopyright 2017 by Paul D. Allison3

Vector of predictor variables xit .Some of these may vary with time, others may not.Assume, for now, that time points are the same for everyone in thesample. (For some methods that assumption is not essential).Why Are Panel Data Desirable?In Econometric Analysis of Panel Data (2008), Baltagi lists six potentialbenefits of panel data:1. Ability to control for individual heterogeneity.2. More informative data: more variability, less collinearity, more degreesof freedom and more efficiency.3. Better ability to study the dynamics of adjustment. For example, a crosssectional survey can tell you what proportion of people are unemployed,but a panel study can tell you the distribution of spells of unemployment.4. Ability to identify and measure effects that are not detectable in purecross-sections or pure time series. For example, if you want to knowwhether union membership increases or decreases wages, you can bestanswer this by observing what happens when workers move from unionto non-union jobs, and vice versa.5. Ability to construct and test more complicated behavioral models thanwith purely cross-section or time-series data. For example, distributedlag models may require fewer restrictions with panel data than with puretime-series data.6. Avoidance of aggregation bias. A consequence of the fact that mostpanel data are micro-level data.Copyright 2017 by Paul D. Allison4

My List1. Ability to control for unobservables.Accomplished by fixed effects methods.2. Ability to resolve causal ordering: Does y cause x or does x cause y?Accomplished by simultaneous estimation of models with laggedpredictors.Methods for doing this are still relatively undeveloped andunderutilized.3. Ability to study the effect of a “treatment” on the trajectory of anoutcome (or, equivalently, the change in a treatment effect over time).Problems with Panel Data1. Attrition and missing data.2. Statistical dependence among multiple observations from the sameindividual. Repeated observations on the same individual are likely to be positivelycorrelated. Individuals tend to be persistently high or persistently low. But conventional statistical methods assume that observations areindependent. Consequently, estimated standard errors tend to be too low, leading totest statistics that are too high and p-values that are too low. Also, conventional parameter estimates may be statistically inefficient(true standard errors are higher than necessary). Many different methods to correct for dependence:Copyright 2017 by Paul D. Allison5

o Robust standard errorso Generalized estimating equations (GEE)o Random effects (mixed) modelso Fixed-effects models Many of these methods can also be used for clustered data that are notlongitudinal, e.g., students within classrooms, people withinneighborhoods.SoftwareI’ll be using Stata 14, with a focus on the xt and me commands.These commands require that the data be organized in the “long form” so thatthere is one record for each individual at each time point, with an ID numberthat is the same for all records for the same individual, and a variable thatindicates which time point the record comes from.All of the methods described here can also be implemented in SAS.Copyright 2017 by Paul D. Allison6

Linear Models for Quantitative ResponseNotation:yit is the value of the response variable for individual i at time t.zi is a column vector of variables that describe individuals but do not varyover timexit is a column vector of variables that vary both over individuals and overtimeBasic model:yit μt βxit γzi ε it ,i 1, , n ; t 1, ,Twhere ε is a random error term with mean 0 and constant variance, assumed tobe uncorrelated with x and z. β and γ are row vectors of coefficients.No lags, different intercepts at each time point, coefficients the same at alltime points.Consider OLS (ordinary least squares) estimation. Coefficients will be unbiased but not efficient. Estimated standard errors will be too low because corr(εit, εit’) 0Example:581 children interviewed in 1990, 1992, and 1994 as part of the NationalLongitudinal Survey of Youth (NLSY).Copyright 2017 by Paul D. Allison7

Time-varying variables:ANTIantisocial behavior, measured with a scale ranging from 0 to 6.SELFself-esteem, measured with a scale ranging from 6 to 24.POVpoverty status of family, coded 1 for in poverty, otherwise 0.Time-invariant variables:BLACK1 if child is black, otherwise 0HISPANIC1 if child is Hispanic, otherwise 0CHILDAGEchild’s age in 1990MARRIED1 if mother was currently married in 1990, otherwise 0GENDER1 if female, 0 if maleMOMAGEmother’s age at birth of childMOMWORK1 if mother was employed in 1990, otherwise 0Original data set nlsy.dta has 581 records, one for each child, with differentnames for the variables at each time point, e.g., ANTI90, ANTI92 andANTI94.We can convert the data into a set of 1743 records, one for each child in eachyear using the reshape command:use c:\data\nlsy.dta, cleargen id nreshape long anti self pov, i(id) j(year)Copyright 2017 by Paul D. Allison8

save persyr3, replaceDatawide- ----------Number of obs.581- 1743Number of variables17- 12j variable (3 values)- yearxij variables:anti90 anti92 anti94- antiself90 self92 self94- selfpov90 pov92 pov94- ----------Note:The time-invariant variables are repeated across the multiple records foreach child.The variable id has a unique ID number for each child.The variable year has values of 90, 92 or 94.Now we’ll do OLS regression, with no correction for dependencereg anti self pov black hispanic childage marriedgender momage momwork i.yearCopyright 2017 by Paul D. Allison9

Source SSdfMS------------- -----------------------------Model 380.8578911 34.6234446Residual 3952.25743 1731 2.28322208------------- -----------------------------Total 4333.11532 1742 2.48743704Number of obsF( 11, 1731)Prob FR-squaredAdj R-squaredRoot MSE ---------anti Coef.Std. Err.tP t [95% Conf. Interval]------------- -------------self -.0741425.0109632-6.760.000-.095645-.0526401pov .4354025.08552755.090.000.2676544.6031505black .1678622.08818391.900.057-.0050959.3408204hispanic age .087056.06221211.400.162-.0349628.2090747married -.0888875.087227-1.020.308-.2599689.082194gender -.4950259.0728886-6.790.000-.637985-.3520668momage k .2120961.08000712.650.008.0551754.3690168year 92 .0521538.08871380.590.557-.1218437.226151294 .2255775.08886392.540.011.0512856.3998694cons Although the coefficients are unbiased, they are not “efficient.” Anestimator is said to be efficient if it has minimal sampling variability.The true standard errors are optimally small.More important, estimated standard errors and p-values are probably toolowSolution 1: Robust standard errorsAlso known as Huber-White standard errors, sandwich estimates, or empiricalstandard errors.For OLS linear models, conventional standard errors are obtained by firstcalculating the estimated covariance matrix of the coefficient estimates:Copyright 2017 by Paul D. Allison10

s 2 (X' X ) 1where X is a matrix of dimension Tn K (the number of coefficients) and s2 isthe residual variance. Standard errors are obtained by taking the square rootsof the main diagonal elements of this matrix.The formula for the robust covariance estimator isˆ (X' X ) 1 X′uˆ uˆ ′ X (X' X ) 1Vi i i i i where Xi is a T x K matrix of covariate values for individual i anduˆ i y i X i βˆis a T x 1 vector of residuals for individual i. The robust standard errors arethe square roots of the main diagonal elements of V̂ .In Stata, this method can be implemented with most regression commandsusing the vce option:reg anti self pov black hispanic childage marriedmomage gender momwork i.year, vce(cluster id)Linear regressionNumber of obsF( 11,580)Prob FR-squaredRoot MSE 17438.990.00000.08791.511(Std. Err. adjusted for 581 clusters in id)Copyright 2017 by Paul D. Allison11

Robustanti Coef.Std. Err.tP t [95% Conf. Interval]------------- -------------self -.0741425.0133707-5.550.000-.1004034-.0478816pov .4354025.10936373.980.000.2206054.6501995black .1678622.13092211.280.200-.0892769.4250014hispanic ge .087056.09390550.930.354-.0973804.2714923married -.0888875.1336839-0.660.506-.3514509.173676momage -.0166933.0241047-0.690.489-.0640364.0306498gender rk .2120961.11897611.780.075-.0215803.4457725year 92 .0521538.05400960.970.335-.0539244.15823294 .2255775.06417663.510.000.0995306.3516245cons 2.6753121.1384262.350.019.43937174.911252Although coefficients are the same, almost all the standard errors are larger.This makes a crucial difference for MOMWORK, BLACK and HISPANIC.Notes: It’s possible for robust standard errors to be smaller than conventionalstandard errors. You generally see a bigger increase in the standard errors for timeinvariant variables than for time-varying variables. Robust SEs are also robust to heteroskedasticity. For small samples, robust standard errors may be inaccurate and havelow power. To get reasonably accurate results, you need at least 20clusters if they are approximately balanced, 50 if they are unbalanced.Solution 2: Generalized Estimating Equations (GEE, populationaveraged models)For linear models, this is equivalent to feasible generalized least squares(GLS).Copyright 2017 by Paul D. Allison12

The attraction of this method is that it produces efficient estimates of thecoefficients (i.e., true standard errors will be optimally small). It does this bytaking the over-time correlations into account when producing the estimates.Conventional least squares estimates are given by the matrix formula( X′X) 1 X′yGLS estimates are obtained byˆ 1 X) 1 X′Ωˆ 1y( X′Ωwhere Ω̂ is an estimate of the covariance matrix for the error terms. For paneldata, this will typically be a “block-diagonal” matrix. For example, if thesample consists of three people with two observations each, the covariancematrix will look like000 σˆ11 σˆ12 0 σˆˆ000 12 σ 22 0 0 σˆ11 σˆ12 00 ˆ 0Ω ˆˆσσ00001222 0ˆˆ000 σ 11 σ 12 000 σˆ12 σˆ 22 0In Stata, the method can be implemented with the xtgee command. It’sconvenient to first declare the data set to be a time-series cross-section data setusing the xtset command.xtset id yearpanel variable:time variable:delta:id (strongly balanced)year, 90 to 94, but with gaps1 unitCopyright 2017 by Paul D. Allison13

xtgee anti self pov black hispanic childage marriedgender momage momwork i.yearGEE population-averaged modelGroup :exchangeableScale parameter:2.275542Number of obsNumber of groupsObs per group: minavgmaxWald chi2(11)Prob chi2 -anti Coef.Std. Err.zP z [95% Conf. Interval]------------- -------------self -.0620764.0094874-6.540.000-.0806715-.0434814pov .2471376.0801363.080.002.090074.4042013black .2267537.12499951.810.070-.018241.4717483hispanic e .0884559.09058310.980.329-.0890836.2659955married -.0495647.1257172-0.390.693-.295966.1968365gender e k .2611318.11405812.290.022.037582.4846815year 92 .0473396.05852990.810.419-.0673769.16205694 .2163811.05870233.690.000.1013267.3314355cons 2.5314311.0897592.320.020.39554224.667321By default, the standard errors are “model based”. Although corrected fordependence, they are sensitive to the particular correlation structure that isspecified.The default correlation structure is “exchangeable”, which means that thecorrelations between the dependent variables at different points in time are allthe same. To see the estimated correlations, use the command:estat wcorrCopyright 2017 by Paul D. Allison14

Estimated within-id correlation matrix R: c1c2c3------ --------------------------------r1 1r2 .56367791r3 .5636779.56367791To get robust standard errors (that aren’t sensitive to the correlation structure),simply add the robust option to the xtgee command:xtgee anti self pov black hispanic childage marriedmomage gender momwork i.year, vce(robust)GEE population-averaged modelGroup :exchangeableScale parameter:2.275542Number of obsNumber of groupsObs per group: minavgmaxWald chi2(11)Prob chi2 174358133.0390.650.0000(Std. Err. adjusted for clustering on ------------------------------ Robustanti Coef.Std. Err.zP z [95% Conf. Interval]------------- -------------self -.0620764.0101609-6.110.000-.0819915-.0421614pov .2471376.08355032.960.003.0833821.4108932black .2267537.1301291.740.081-.0282945.4818019hispanic e .0884559.09398410.940.347-.0957496.2726615married -.0495647.1341853-0.370.712-.3125631.2134336momage -.0219197.0239744-0.910.361-.0689087.0250693gender rk .2611318.11632662.240.025.0331359.4891276year 92 .0473396.05354290.880.377-.0576025.152281794 .2163811.06349533.410.001.0919327.3408295cons -------------------Copyright 2017 by Paul D. Allison15

With only three time points, you’re probably better off specifying an“unstructured” model that imposes no pattern on the correlation matrix:xtgee anti self pov black hispanic childage marriedmomage gender momwork i.year, vce(r) corr(uns)GEE population-averaged modelGroup and time vars:id cturedScale parameter:2.273983Number of obsNumber of groupsObs per group: minavgmaxWald chi2(11)Prob chi2 174358133.0394.510.0000(Std. Err. adjusted for clustering on ------------------------------ Robustanti Coef.Std. Err.zP z [95% Conf. Interval]------------- -------------self -.0629882.0101177-6.230.000-.0828186-.0431579pov .268169.08345733.210.001.1045958.4317423black .2129144.12989731.640.101-.0416796.4675084hispanic ge .0852542.09346590.910.362-.0979356.2684441married -.050604.1335751-0.380.705-.3124065.2111984momage -.0202607.02389-0.850.396-.0670842.0265628gender k .2525486.11601872.180.029.0251561.479941year 92 .0477502.05354560.890.373-.0571972.152697694 .2171697.06350993.420.001.0926927.3416468cons -------------------estat wcorrCopyright 2017 by Paul D. Allison16

Estimated within-id correlation matrix R: c1c2c3------ --------------------------------r1 1r2 .55124891r3 .5193459.61861951With many time points the number of unique correlations will get large:T(T-1)/2. And unless the sample is also large, estimates of all theseparameters may be unreliable.In that case, consider restricted models:TYPE DescriptionAR# Autoregressive oforder #STA# Stationary of order #NON# Non-stationary oforder #Formula#ε it θ jε it j ν itj 1ρts ρ when t-s #, t s otherwise ρts 0ρts ρts when t-s #,otherwise ρts 0Results will often be robust to choice of correlation structure, but sometimes itcan make a big difference. An autoregressive structure of order 1 is usuallytoo restrictive: the correlation goes down too rapidly with the time distance.GEE can handle missing data on the response variable (or unbalanced panels)under the assumption that the data are missing completely at random, or thatmissingness depends only on the predictors. It does not allow missingness ony at one time to depend on observed values of y at other times.Copyright 2017 by Paul D. Allison17

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: February 20-21, 2018, Stockholm, Sweden