Longitudinal Data Analysis Using R - Statistical Horizons

Transcription

Longitudinal Data AnalysisUsing RStephen Vaisey, Ph.D.Upcoming Seminar:August 1-2, 2019, Philadelphia, Pennsylvania

Longitudinal Data Analysis Using StataThis handbook, which was prepared by Paul Allison in June 2018, closely parallelsthe slides for Stephen Vaisey’s course on Longitudinal Data Analysis Using R.Stata data sets for the examples and exercises can be downloaded tatisticalHorizons.com

Table of Contents . 3Why Are Panel Data Desirable? . 4Problems with Panel Data . 5Software . 6Linear Models for Quantitative Response Variables . 7Example: . 7Solution 1: Robust standard errors . 11Solution 2: Generalized Least Squares with Maximum Likelihood . 13Data Requirements for Timing of Repeated Measurements . 21Solution 3: Random Effects (Mixed) Models. 22A Random Growth Curve Model . 28Models for Three Level Data . 31Combining Methods 2 and 3 . 34Solution 4: Fixed Effects Models . 34Interactions in FE models . 40Fixed Effects Regression with Two Time Points . 44Hybrid Method . 46Logistic Models for Categorical Response Variables . 53Solution 1. Robust Standard Errors. 56Solution 2. GEE . 57Solution 3. Random Effects Models . 62Solution 4. Fixed Effects Models. 68Fixed Effects Logistic Regression for Two-Period Data . 72Hybrid Method for Logistic Regression . 73Ordered Logit Models for Response Variables with More than Two Categories. . 78Multinomial Logit Models for Response Variables with More than Two Categories. 79Models for Count Data . 80Robust Standard Errors . 84GEE Estimation . 85Random Effects Models . 87Fixed Effects Models . 89Linear Structural Equations Models for Fixed and Random Effects . 92Random Effects . 95Assignments for Panel Data Course . 102Assignment 1. Linear Models. 102Assignment 2. Logistic Models. 103Assignment 3. Count Data Models. . 104Copyright 2018 by Paul D. Allison2

Outline1. Opportunities and challenges of panel data.a. Data requirementsb. Control for unobservablesc. Determining causal orderd. Problem of dependencee. Software considerations2. Linear modelsa. Robust standard errorsb. Generalized least squares with MLc. Random effects modelsd. Fixed effects modelse. Between-within models3. Logistic regression modelsa. Robust standard errorsb. GEEc. Subject-specific vs. population averaged methodsd. Random effects modelse. Fixed effects modelsf. Between-within models4. Count data modelsa. Poisson modelsb. Negative binomial models5. Linear structural equation modelsa. Fixed and random effects in the SEM contextb. Models for reciprocal causation with lagged effectsPanel DataData in which variables are measured at multiple points in time for the sameindividuals.Response variable yit with t 1, 2, , TCopyright 2018 by Paul D. Allison3

Vector of predictor variables xit .Some of these may vary with time, others may not.Assume, for now, that time points are the same for everyone in thesample. (For some methods that assumption is not essential).Why Are Panel Data Desirable?In Econometric Analysis of Panel Data (2008), Baltagi lists six potentialbenefits of panel data:1. Ability to control for individual heterogeneity.2. More informative data: more variability, less collinearity, more degreesof freedom and more efficiency.3. Better ability to study the dynamics of adjustment. For example, a crosssectional survey can tell you what proportion of people are unemployed,but a panel study can tell you the distribution of spells of unemployment.4. Ability to identify and measure effects that are not detectable in purecross-sections or pure time series. For example, if you want to knowwhether union membership increases or decreases wages, you can bestanswer this by observing what happens when workers move from unionto non-union jobs, and vice versa.5. Ability to construct and test more complicated behavioral models thanwith purely cross-section or time-series data. For example, distributedlag models may require fewer restrictions with panel data than with puretime-series data.6. Avoidance of aggregation bias. A consequence of the fact that mostpanel data are micro-level data.Copyright 2018 by Paul D. Allison4

My List1. Ability to control for unobservables.Accomplished by fixed effects methods.2. Ability to investigate causal ordering: Does y cause x or does x cause y?Accomplished by simultaneous estimation of models with laggedpredictors.Methods for combining fixed effects with cross-lagged modelshave only recently been developed and not often used (outside ofeconomics)3. Ability to study the effect of a “treatment” on the trajectory of anoutcome (or, equivalently, the change in a treatment effect over time).Problems with Panel Data1. Attrition and missing data.2. Statistical dependence among multiple observations from the sameindividual. Repeated observations on the same individual are likely to be positivelycorrelated. Individuals tend to be persistently high or persistently low. But conventional statistical methods assume that observations areindependent. Consequently, estimated standard errors tend to be too low, leading totest statistics that are too high and p-values that are too low. Also, conventional parameter estimates may be statistically inefficient(true standard errors are higher than necessary).Copyright 2018 by Paul D. Allison5

Many different methods to correct for dependence:o Robust standard errorso Generalized least squareso Generalized estimating equations (GEE)o Random effects (mixed) modelso Fixed-effects models Many of these methods can also be used for clustered data that are notlongitudinal, e.g., students within classrooms, people withinneighborhoods.SoftwareI’ll be using Stata 15, with a focus on the xt and me commands.These commands require that the data be organized in the “long form” so thatthere is one record for each individual at each time point, with an ID numberthat is the same for all records for the same individual, and a variable thatindicates which time point the record comes from. The “wide form” has onerecord per individual.All of the methods described here can also be implemented in SAS.Copyright 2018 by Paul D. Allison6

Linear Models for Quantitative ResponseNotation:yit is the value of the response variable for individual i at time t.zi is a column vector of variables that describe individuals but do not varyover timexit is a column vector of variables that vary both over individuals and overtimeBasic model:yit μt βxit γzi ε it ,i 1, , n ; t 1, ,Twhere ε is a random error term with mean 0 and constant variance, assumed tobe uncorrelated with x and z.β and γ are row vectors of coefficients.No lags, different intercepts at each time point, coefficients the same at alltime points.Consider OLS (ordinary least squares) estimation. Coefficients will be unbiased but not efficient. An efficient estimator isone whose true standard error is as small as possible, i.e., minimalvariability across repeated samples. Estimated standard errors will be too low because corr(εit, εit’) 0Example:581 children interviewed in 1990, 1992, and 1994 as part of the NationalLongitudinal Survey of Youth (NLSY).Copyright 2018 by Paul D. Allison7

Time-varying variables:ANTIantisocial behavior, measured with a scale ranging from 0 to 6.SELFself-esteem, measured with a scale ranging from 6 to 24.POVpoverty status of family, coded 1 for in poverty, otherwise 0.Time-invariant variables:BLACK1 if child is black, otherwise 0HISPANIC1 if child is Hispanic, otherwise 0CHILDAGEchild’s age in 1990MARRIED1 if mother was currently married in 1990, otherwise 0GENDER1 if female, 0 if maleMOMAGEmother’s age at birth of childMOMWORK1 if mother was employed in 1990, otherwise 0Original data set nlsy.dta has 581 records, one for each child, with differentnames for the variables at each time point, e.g., ANTI90, ANTI92 andANTI94.Before converting from the wide form to the long form, let’s look at the overtime correlations for the dependent variable.use c:\data\nlsy.dta, clearcorr anti*Copyright 2018 by Paul D. Allison8

(obs 581) anti90anti92anti94------------- --------------------------anti90 1.0000anti92 0.63801.0000anti94 0.54470.60081.0000Note that the 4-year lag correlation is smaller than the two 2-year lagcorrelations.Using the reshape command, we now convert the data into the long form, aset of 1743 records, one for each child in each year:use c:\data\nlsy.dta, cleargen id nreshape long anti self pov, i(id) j(year)save persyr3, replaceDatawide- ----------Number of obs.581- 1743Number of variables17- 12j variable (3 values)- yearxij variables:anti90 anti92 anti94- antiself90 self92 self94- selfpov90 pov92 pov94- ----------Note:The time-invariant variables are repeated across the multiple records foreach child.The variable id has a unique ID number for each child.Copyright 2018 by Paul D. Allison9

The variable year has values of 90, 92 or 94.Now we’ll do OLS regression, with no correction for dependencereg anti self pov black hispanic childage marriedgender momage momwork i.yearSource SSdfMS------------- -----------------------------Model 380.8578911 34.6234446Residual 3952.25743 1731 2.28322208------------- -----------------------------Total 4333.11532 1742 2.48743704Number of obsF( 11, 1731)Prob FR-squaredAdj R-squaredRoot MSE ---------anti Coef.Std. Err.tP t [95% Conf. Interval]------------- -------------self -.0741425.0109632-6.760.000-.095645-.0526401pov .4354025.08552755.090.000.2676544.6031505black .1678622.08818391.900.057-.0050959.3408204hispanic age .087056.06221211.400.162-.0349628.2090747married -.0888875.087227-1.020.308-.2599689.082194gender -.4950259.0728886-6.790.000-.637985-.3520668momage k .2120961.08000712.650.008.0551754.3690168year 92 .0521538.08871380.590.557-.1218437.226151294 .2255775.08886392.540.011.0512856.3998694cons Although the coefficients are unbiased, they are not efficient (truestandard errors are larger than necessary).More important, reported standard errors and p-values are probably toolowCopyright 2018 by Paul D. Allison10

Solution 1: Robust standard errorsRobust standard errors are standard error estimates that correct for dependenceamong the repeated observations. Also known as Huber-White standard errors,sandwich estimates, or empirical standard errors.For OLS linear models, conventional standard errors are obtained by firstcalculating the estimated covariance matrix of the coefficient estimates:s 2 (X' X ) 1where s2 is the residual variance and X is a matrix of dimension Tn K (n isthe number of individuals, T is the number ot time periods, and K is thenumber of coefficients). Standard errors are obtained by taking the squareroots of the main diagonal elements of this matrix.The formula for the robust covariance estimator isˆ ( X ' X ) 1 X′ y X βˆ y X βˆ ′ X ( X ' X ) 1Viiiiii i ()()where Xi is a T x K matrix of covariate values for individual i and yiis a T x 1 vector of y values for individual i. The robust standard errors are thesquare roots of the main diagonal elements of V̂ .In Stata, this method can be implemented with most regression commandsusing the vce option:reg anti self pov black hispanic childage marriedmomage gender momwork i.year, vce(cluster id)Linear regressionNumber of obs F( 11,580) Prob F Copyright 2018 by Paul D. Allison17438.990.000011

R-squaredRoot MSE 0.08791.511(Std. Err. adjusted for 581 clusters in id) Robustanti Coef.Std. Err.tP t [95% Conf. Interval]------------- -------------self -.0741425.0133707-5.550.000-.1004034-.0478816pov .4354025.10936373.980.000.2206054.6501995black .1678622.13092211.280.200-.0892769.4250014hispanic ge .087056.09390550.930.354-.0973804.2714923married -.0888875.1336839-0.660.506-.3514509.173676momage -.0166933.0241047-0.690.489-.0640364.0306498gender rk .2120961.11897611.780.075-.0215803.4457725year 92 .0521538.05400960.970.335-.0539244.15823294 .2255775.06417663.510.000.0995306.3516245cons 2.6753121.1384262.350.019.43937174.911252Although coefficients are the same, almost all the standard errors are larger.This makes a crucial difference for MOMWORK, BLACK and HISPANIC.Notes: Robust standard errors may be smaller than conventional standard errors. You generally see a bigger increase in the standard errors for timeinvariant variables than for time-varying variables. Standard errors fortime itself often decrease. Robust SEs are also robust to heteroscedasticity and non-normality. In small samples, robust standard errors may be inaccurate and have lowpower. For reasonably accurate results, you need at least 20 clusters ifthey are approximately balanced, 50 if they are unbalanced. SeeCameron & Miller (2015) Journal of Human ResourcesCopyright 2018 by Paul D. Allison12

Solution 2: Generalized Least Squares (GLS) with MaximumLikelihood.The attraction of this method is that it, in addition to getting the standarderrors right, it produces efficient estimates of the coefficients (i.e., truestandard errors will be optimally small). It does this by taking the over-timecorrelations into account when producing the coefficient estimates.Conventional least squares estimates are given by the matrix formula( X′X) 1 X′yGLS estimates are obtained byˆ 1 X) 1 X′Ωˆ 1y( X′Ωwhere Ω̂ is an estimate of the covariance matrix for the error terms. For paneldata, this will typically be a “block-diagonal” matrix. For example, if thesample consists of three people with two observations each, the covariancematrix will look like000 σˆ11 σˆ12 0 σˆˆ000 12 σ 22 0 0 σˆ11 σˆ12 00 ˆ 0Ω ˆˆσσ00001222 0ˆˆ000 σ 11 σ 12 000 σˆ12 σˆ 22 0where σ̂11 is an estimate of var(εi1), σ̂22 is an estimate of var(εi2), and σ̂12 is anestimate of cov(εi1, εi2). We assume that these variances and covariances arethe same across individuals.There are many different ways to estimate these variances and covariances. Iused to focus on the method of generalized estimating equations (GEE), asCopyright 2018 by Paul D. Allison13

implemented with the xtgee command. We will use this method later forlogistic regression. For linear models, I now prefer maximum likelihood,implemented with the mixed command:mixed anti self pov black hispanic childage married gendermomage momwork i.year id:, noconstantresiduals(unstructured,t(year)) stddeviations noconstant says “don’t fit a random intercepts model” (see the nextsection). residuals says “estimate variances and covariances for the error terms.” unstructured says “don’t impose any structure on variances andcovariances.” t(year) sets the time dimension. stddeviations says “report standard deviations and correlations insteadof variances and covariances.”Mixed-effects ML regressionGroup variable: idNumber of obsNumber of groupsObs per group:minavgmax 1,743581 33.03Wald chi2(11) 105.93Log likelihood -2910.4053Prob chi2 ---------------------------------anti Coef.Std. Err.zP z [95% Conf. Interval]------------- -------------self -.0597951.0093479-6.400.000-.0781166-.0414736pov .2739148.07973493.440.001.1176373.4301924black .2221163.12345051.800.072-.0198422.4640748hispanic ge .0607793.08952050.680.497-.1146778.2362363married -.0376591.1244933-0.300.762-.2816615.2063433gender e k .2661671.11277862.360.018.045125.4872091Copyright 2018 by Paul D. Allison14

year 92 .0468588.05327370.880.379-.0575558.151273494 .2155011.06285233.430.001.0923129.3386894 cons ------------------Random-effects Parameters EstimateStd. Err.[95% Conf. Interval]----------------------------- (empty) ----------------------------- idual: Unstructured sd(e90) 1.402862.04137351.3240711.486342sd(e92) 1.47728.0435121.3944131.565072sd(e94) 1.635629.04828591.5436761.733059corr(e90,e92) .6045961.0264302.550231.6538571corr(e90,e94) .5151491.0308114.4522374.57296corr(e92,e94) ---------LR test vs. linear model: chi2(5) 552.57Prob chi2 0.0000Note: The reported degrees of freedom assumes the null hypothesis is not on theboundary of the parameter space. If this is not true, then the reported test isconservative.The coefficient for POV is noticeably smaller, although still highly significant.The coefficient for MOMWORK is somewhat larger, and the p-value is againwell below .05.The over-time correlations for the errors are all quite large. At the bottom weget a likelihood ratio chi-square test of the null hypothesis that (a) the threecorrelations are 0, and (b) the three standard deviations are the same. This isclearly rejected.With five or fewer time points, the unstructured model is usually the best wayto go. With many time points the number of unique correlations will getlarge: T(T-1)/2. And unless the sample is also large, estimates of all theseparameters may be unstable.Copyright 2018 by Paul D. Allison15

In that case, consider restricted models. Let ρts be the correlation between εitand εis, i.e., errors for the same individual at time t and time s. Here are somepossible structures for longitudinal data:TYPEEXCHAR#DescriptionFormulaExchangeable, i.e., equalρts ρcorrelations#Autoregressive of order # ε θεitj 1TOEPLITZ#EXPONENTIALCorrelation depends ondistance between t and sA generalization of AR1to allow unequal gapsj it j ν itρts ρ when t-s #, t s otherwiseρts 0ρts ρ t s Results will often be robust to choice of correlation structure, but sometimes itcan make a big difference. Unfortunately, all four of these structures presumethat the error variances are constant over time. No way to relax that.The maximum order of AR and Toeplitz is one less than the number of timepoints. That’s the default for Toeplitz (good) but the default for AR is 1 (bad).With AR1 the correlation usually goes down too rapidly with the distancebetween measurements.Let’s try the exchangeable structure with the NLSY data:mixed anti self pov black hispanic childage marriedgender momage momwork i.year id:, noconres(exch) stddevNotice that you don’t need the t(year) option with this correlation structure.Copyright 2018 by Paul D. Allison16

2. Statistical dependence among multiple observations from the same individual. Repeated observations on the same individual are likely to be positively correlated. Individuals tend to be persistently high or persistently low. But conventional statistical methods assume that observations are independent.