Missing Data & How To Deal: An Overview Of Missing Data

Transcription

Missing Data & How to Deal: Anoverview of missing dataMelissa HumphriesPopulation Research Center

Goals Discuss ways to evaluate and understand missing dataDiscuss common missing data methodsKnow the advantages and disadvantages of commonmethodsReview useful commands in Stata for missing data

General Steps for Analysis with MissingData 1. Identify patterns/reasons for missing and recodecorrectly2. Understand distribution of missing data3. Decide on best method of analysis

Step One: Understand your data Attrition due to social/natural processes Skip pattern in survey Example: School graduation, dropout, deathExample: Certain questions only asked to respondents whoindicate they are marriedIntentional missing as part of data collection processRandom data collection issuesRespondent refusal/Non-response

Find information from survey(codebook, questionnaire) Identify skip patterns and/or sampling strategy from documentation

Recode for analysis: mvdecode command MvdecodeHow stata reads missing Tip . #‟sNmissing npresent

Recode for analysis: mvdecode command MvdecodeHow stata reads missing Tip . #‟sNmissing npresentNote: Stata readsmissing (.) as avalue greater thanany number.

Analyze missing data patterns: misstable command

Step Two: Missing data Mechanism (orprobability distribution of missingness) Consider the probability of missingness Are certain groups more likely to have missing values? Are certain responses more likely to be missing? Example: Respondents in service occupations less likely to reportincomeExample: Respondents with high income less likely to report incomeCertain analysis methods assume a certain probabilitydistribution

Missing Data Mechanisms Missing Completely at Random (MCAR) Missing value (y) neither depends on x nor y Example: some survey questions asked of a simple randomsample of original sampleMissing at Random (MAR) Missing value (y) depends on x, but not y Example: Respondents in service occupations less likely to reportincomeMissing not at Random (NMAR) The probability of a missing value depends on the variable thatis missing Example: Respondents with high income less likely to report income

Exploring missing data mechanisms Can‟t be 100% sure about probability of missing (since wedon‟t actually know the missing values)Could test for MCAR (t-tests)—but not totally accurateMany missing data methods assume MCAR or MAR butour data often are MNAR Some methods specifically for MNAR Selection model (Heckman)Pattern mixture models

Good News!! Some MAR analysis methods using MNAR data are stillpretty good. May be another measured variable that indirectly can predictthe probability of missingness Example: those with higher incomes are less likely to report incomeBUT we have a variable for years of education and/or number ofinvestmentsML and MI are often unbiased with NMAR data even thoughassume data is MAR See Schafer & Graham 2002

Step 3: Deal with missing data Use what you know about Why data is missingDistribution of missing dataDecide on the best analysis strategy to yield the leastbiased estimates Deletion Methods Single Imputation Methods Listwise deletion, pairwise deletionMean/mode substitution, dummy variable method, single regressionModel-Based Methods Maximum Likelihood, Multiple imputation

Deletion Methods Listwise deletion AKA complete case analysisPairwise deletion

Listwise Deletion (Complete Case Analysis) Only analyze cases withavailable data on eachvariable Advantages: SimplicityComparability acrossanalysesDisadvantages: Reduces statistical power(because lowers n)Doesn‟t use all informationEstimates may be biased ifdata not MCAR*Gender8th grade mathtest score12th grademath 75F85.*NOTE: List-wise deletion often produces unbiased regression slope estimates as long asmissingness is not a function of outcome variable.

Application in Stata Any analysis including multiple variables automaticallyapplies listwise deletion.

Pairwise deletion (Available Case Analysis) Analysis with all cases inwhich the variables ofinterest are present. Advantage: Keeps as many cases aspossible for each analysisUses all information possiblewith each analysisDisadvantage: Can‟t compare analysesbecause sample differenteach time

Single imputation methods Mean/Mode substitutionDummy variable controlConditional mean substitution

Mean/Mode Substitution Replace missing value with sample mean or modeRun analyses as if all complete casesAdvantages: Can use complete case analysis methodsDisadvantages: Reduces variabilityWeakens covariance and correlation estimates in the data (becauseignores relationship between variables)

80604020203040508th grade math test scoreimputed 12th grade math test score (mean sub)6070

Dummy variable adjustment Create an indicator for missing value (1 value is missingfor observation; 0 value is observed for observation)Impute missing values to a constant (such as the mean)Include missing indicator in regressionAdvantage: Disadvantage: Uses all available information about missing observationResults in biased estimatesNot theoretically drivenNOTE: Results not biased if value is missing because of alegitimate skip

Regression Imputation Replaces missing values with predicted score from aregression equation. Advantage: Uses information from observed dataDisadvantages: Overestimates model fit and correlation estimatesWeakens variance

80604020203040508th grade math test score60imputed 12th grade math test score (single regression)70

Model-based methods Maximum LikelihoodMultiple imputation

Model-based Methods: Maximum LikelihoodEstimation Identifies the set of parameter values that produces thehighest log-likelihood. ML estimate: value that is most likely to have resulted in theobserved dataConceptually, process the same with or without missingdata Advantages: Uses full information (both complete cases and incomplete cases) tocalculate log likelihoodUnbiased parameter estimates with MCAR/MAR dataDisadvantages SEs biased downward—can be adjusted by using observedinformation matrix

Multiple Imputation 1. Impute: Data is „filled in‟ with imputed values usingspecified regression model This step is repeated m times, resulting in a separate dataseteach time.2. Analyze: Analyses performed within each dataset3. Pool: Results pooled into one estimate Advantages: Variability more accurate with multiple imputations for each missingvalue Considers variability due to sampling AND variability due to imputationDisadvantages: Cumbersome codingRoom for error when specifying models

Multiple Imputation Process1. Impute2. Analyze3. PoolFinal EstimatesDataset with Missing ValuesImputed DatasetsAnalysis results ofeach dataset

Multiple Imputation: Stata & SAS SAS: Proc miStata: ice (imputation using chained equations) & mim (analysis withmultiply imputed dataset)mi commands mi setmi registermi imputemi estimateNOTE: the ice command is the only chained equation methoduntil Stata12. Chained equations can be used as an option ofmi impute since Stata12.

ice & mim ice: Imputation using chained equations Series of equations predicting one variable at a timeCreates as many datasets as desiredmim: prefix used before analysis that performs analysesacross datasets and pools estimates

ice command1. Impute2. Analyze3. PoolFinal EstimatesDataset with Missing ValuesImputed DatasetsAnalysis results ofeach dataset

mim command1. Impute2. Analyze3. PoolFinal EstimatesDataset with Missing ValuesImputed DatasetsAnalysis results ofeach dataset

ice female lm latino black asian other F1PARED AGE1 intact bymirt ESL2 ALG2OH acgpa ac engallhardwtr Lksch MAE10 RAE10 hilep midw south public catholic colltype aceng ESL Lksch ESL, ///saving(imputed2) m(5) cmd (Lksch:ologit)Variable Command Prediction equation------------ --------- ----female [No missing data in estimation sample]lm [No missing data in estimation sample]latino [No missing data in estimation sample]black [No missing data in estimation sample]ALG2OH logit female lm latino black asian other F1PARED AGE1 intact bymirt ESL2 acgpa ac engall hardwtr Lksch MAE10 RAE10 hilep midw south public catholic colltype aceng ESL Lksch ESLacgpa regress female lm latino black asian other F1PARED AGE1 intact bymirt ESL2 ALG2OH ac engall hardwtr Lksch MAE10 RAE10 hilep midw south public catholic colltype aceng ESL Lksch ESLac engall regress female lm latino black asian other F1PARED AGE1 intact bymirt ESL2 ALG2OH acgpa hardwtr Lksch MAE10 RAE10 hilep midw south public catholic colltype Lksch ESLhardwtr logit female lm latino black asian other F1PARED AGE1 intact bymirt ESL2 ALG2OH acgpa ac engall Lksch MAE10 RAE10 hilep midw south public catholic colltype aceng ESL Lksch ESLLksch ologit female lm latino black asian other F1PARED AGE1 intact bymirt ESL2 ALG2OH acgpa ac engall hardwtr MAE10 RAE10 hilep midw south public catholic colltype aceng ESLMAE10 regress female lm latino black asian other F1PARED AGE1 intact bymirt ESL2 ALG2OH acgpa ac engall hardwtr Lksch RAE10 hilep midw south public catholic colltype aceng ESL Lksch ESLRAE10 regress female lm latino black asian other F1PARED AGE1 intact bymirt ESL2 ALG2OH acgpa ac engall hardwtr Lksch MAE10 hilep midw south public catholic colltype aceng ESL Lksch ESLhilep logit female lm latino black asian other F1PARED AGE1 intact bymirt ESL2 ALG2OH acgpa ac engall hardwtr Lksch MAE10 RAE10 midw south public catholic colltype aceng ESL Lksch ------------------------------Imputing .1.2.3.4.5file imputed2.dta saved

mim, storebv: svy: mlogit colltype ESL2 lm female latino black asian other F1PARED lowincAGE1 intact bymirt ALG2OH acgpa Lksch, b(0)Multiple-imputation estimates (svy: mlogit)Survey: Multinomial logistic regressionImputations Minimum obs Minimum dof --------------------------------------colltype Coef. Std. Err.tP t [95% Conf. Int.]FMI------------- -------------1 ESL2 -.375638.172034-2.180.029-.713366 -.0379090.000lm .448917.1321123.400.001.189561 .7082720.000female .243528.0739123.290.001.098427 .3886280.000latino .005748.125920.050.964-.241452 .2529480.000black .133203.1207741.100.270-.103895.37030.001asian .342303.1721571.990.047.004332 .6802730.000other -.432693.165435-2.620.009-.757467 -.107920.000F1PARED .170113.0330345.150.000.105262 .2349640.000lowinc -.256572.104118-2.460.014-.460972 -.0521710.000AGE1 -.40664.086344-4.710.000-.576146 -.2371340.000intact .177401.0695992.550.011.040768 .3140340.000bymirt .012549.0037253.370.001.005238 .0198610.001ALG2OH .483054.0816185.920.000.322823 .6432850.003acgpa .498286.0578798.610.000.384662 .6119110.001Lksch .068058.0599651.130.257-.049733 .1858480.043cons -2.31716.218751 -10.590.000-2.74661 -1.88770.009------------- -------------2 ESL2 -.845765.274438-3.080.002-1.38453 -.3070010.000lm .197629.1488861.330.185-.094659 .4899160.001female .150927.0853021.770.077-.016534 .3183890.000latino -.041933.150431-0.280.781-.337251 .2533860.000black 1.00681.134837.470.000.7421151.27150.002asian .521137.1739923.000.003.179564 .8627110.000other .09246.1893090.490.625-.279183 .4641030.000F1PARED .454096.0337813.440.000.387781 .5204110.001lowinc -.379085.119948-3.160.002-.614563 -.1436080.000AGE1 -.394956.089359-4.420.000-.570382 -.2195310.001intact .279601.0836773.340.001.11533 .4438730.000bymirt .048036.00455810.540.000.039089 .0569840.001ALG2OH 1.49035.11604212.840.0001.26254 1.718160.004acgpa 1.51343.0767219.730.0001.36282 1.664040.001Lksch .236965.0701353.380.001.099176 .3747540.048cons -7.84432.292087 -26.860.000-8.41775 -7.270890.008

mi commands Included in Stata 11Includes univariate multiple imputation (impute only onevariable)Multivariate imputation probably more useful for our dataSpecific order: mi setmi registermi imputemi estimate

mi set and mi register commands1. Impute2. Analyze3. PoolFinal EstimatesDataset with Missing ValuesImputed DatasetsAnalysis results ofeach dataset

mi impute command1. Impute2. Analyze3. PoolFinal EstimatesDataset with Missing ValuesImputed DatasetsAnalysis results ofeach dataset

mi estimate command1. Impute2. Analyze3. PoolFinal EstimatesDataset with Missing ValuesImputed DatasetsAnalysis results ofeach dataset

*******set data to be multiply imputed (can set to 'wide' format also)mi set flong*******register variables as "imputed" (variables with missing data that you want imputed)or "regular"mi register imputed readtest8 worked mathtest8mi register regular sex race*******describing datami describe*******setting seed so results are replicableset seed 8945*******imputing using chained equations—using ols regression for predicting read and mathtest using mlogit to predict workedmi impute chained (regress) readtest8 mathtest8 (mlogit) worked sex i.race, add(10)********check new imputed datasetmi describe*******estimating model using imputed valuesmi estimate:regress mathtest12 mathtest8 sex race

Dataset after imputation

Notes and help with mi in stata LOTS of options Can specify exactly how you want imputedCan specify the model appropriately (ex. Using svy command)mi impute mvn (multivariate normal regression) also usefulHelp mi is usefulAlso, UCLA has great website about ice and mi

General Tips Try a few methods: often if result in similar estimates, canput as a footnote to support methodSome don‟t impute dependent variable But would still use to impute independent variables

References Allison, Paul D. 2001. Missing Data. Sage University PapersSeries on Quantitative Applications in the Social Sciences.Thousand Oaks: Sage.Enders, Craig. 2010. Applied Missing Data Analysis.Guilford Press: New York.Little, Roderick J., Donald Rubin. 2002. Statistical Analysiswith Missing Data. John Wiley & Sons, Inc: Hoboken.Schafer, Joseph L., John W. Graham. 2002. “Missing Data:Our View of the State of the Art.” Psychological Methods.

Create an indicator for missing value (1 value is missing for observation; 0 value is observed for observation) Impute missing values to a constant (such as the mean) Include missing indicator in regression Advantage: Uses all available information about miss