Multiple Imputation For Skewed Multivariate Data: A Marriage Of The MI .

Transcription

Paper 3605-2019Multiple Imputation for Skewed Multivariate Data: A Marriageof the MI and COPULA ProceduresZhixin Lun, Ravindra Khattree, Oakland UniversityABSTRACTMissing data is a common phenomenon in various data analyses. Imputation is a flexiblemethod for handling missing-data problems since it efficiently uses all the availableinformation in the data. Apart from regression imputation approach, the MI procedure inSAS also provides the multiple imputation options which create multiple data sets based onMarkov chain Monte Carlo (MCMC) and fully conditional specification (FCS) methods.However, these methods may not work very effectively for skewed multivariate data sincethey require the assumption of multivariate normal distribution. To deal with such data, weintroduce an approach based on copula transformation. We combine imputation using PROCMI and copula theory using PROC COPULA to arrive at an approach to solve the missingdata problem for skewed multivariate data. We implement and demonstrate the use of thismethod through simulated examples under the assumption that data are missing completelyat random (MCAR).INTRODUCTIONMost of the methodology available for missing data imputation assumes data distributed asmultivariate normal (see Little and Rubin 2002, Rao et al. 2007). Applying normality-basedimputation in skewed data may cause practical issues for the simple reason of violation ofdistributional assumptions. One common way to deal with non-normal data is to applynormalizing transformation prior to the imputation phase and then back-transform tooriginal scale at the analysis phase. However, transformation of each variable individuallymay alter the association structure among variables and hence may impact the accuracy ofimputations.As Bahuguna and Khattree (2019) illustrated, based on copula transformation, multivariateskewed data can be transformed to any other multivariate distribution without losingdependence information among random variables. This property provides an approach tonormalize multivariate skewed data and more importantly, ensures that existing normalitybased imputation methods are applicable for the analysis of multivariate skewed data. Ourwork here builds on this crucial and important observation.The objective of this work is to illustrate the implementation of above ideas by applying thecopula transformation using PROC COPULA and to combine PROC MI for multiple imputationfor the missing data in case of skewed multivariate data. In the following section, we revisitthe basic concept of copula and the Sklar's theorem (Sklar, 1959), which is the foundationof copula transformation, and then we show the details of copula transformation algorithmand its implementation in SAS.COPULAS AND COPULA TRANSFORMATIONTHE COPULA TRANSFORMATIONIn copula theory, copula is a multivariate probability distribution where the marginalprobability distribution of each variable is uniform. In other words, a function ๐ถ is a ๐‘‘-1

dimensional copula if there is a random vector ๐‘ˆ (๐‘ˆ1 , ๐‘ˆ2 , , ๐‘ˆ๐‘‘ )โ€ฒ , such that for ๐‘– 1,2, , ๐‘‘,๐‘ˆ๐‘– Uniform (0,1), and๐ถ(๐‘ข1 , ๐‘ข2 , , ๐‘ข๐‘‘ ) ๐‘ƒ[๐‘ˆ1 ๐‘ข1 , ๐‘ˆ2 ๐‘ข2 , , ๐‘ˆ๐‘‘ ๐‘ข๐‘‘ ].The most important theorem in copula theory is the Sklar's theorem (Sklar, 1959), whichstates that a function F: Rd [0,1] is the distribution function of a random vector ๐‘‹ (๐‘‹1 , ๐‘‹2 , , ๐‘‹๐‘‘ )โ€ฒ if and only if there is a copula ๐ถ from [0,1]๐‘‘ to [0,1] and ๐‘‘ univariate distributionfunctions ๐น1 , ๐น2 , , ๐น๐‘‘ such that๐ถ(๐น1 (๐‘ฅ1 ), ๐น2 (๐‘ฅ2 ), , ๐น๐‘‘ (๐‘ฅ๐‘‘ )) ๐น(๐‘ฅ1 , ๐‘ฅ2 , , ๐‘ฅ๐‘‘ ).This theorem indirectly implies that two different continuous multivariate distributions canbe transformed to each other via the same copula. Specifically, consider two differentcontinuous multivariate cumulative distributions denoted by ๐น( ) and ๐บ( ) and assume thatthey have a common copula. Then the transformation is shown as follows from Sklar'stheorem,๐‘ญ(๐’™๐Ÿ , ๐’™๐Ÿ , , ๐’™๐’… ) ๐‘ช(๐‘ญ๐Ÿ (๐’™๐Ÿ ), ๐‘ญ๐Ÿ (๐’™๐Ÿ ), , ๐‘ญ๐’… (๐’™๐’… )) ๐‘ช(๐’–๐Ÿ , ๐’–๐Ÿ , , ๐’–๐’… ) ๐‘ช(๐‘ฎ๐Ÿ (๐’š๐Ÿ ), ๐‘ฎ๐Ÿ (๐’š๐Ÿ ), , ๐‘ฎ๐’… (๐’š๐’… )) ๐‘ฎ(๐’š๐Ÿ , ๐’š๐Ÿ , , ๐’š๐’… ),(๐Ÿ)where ๐น๐‘– ( ) and ๐บ๐‘– ( ) are the corresponding marginal cumulative distribution functions arisingout of ๐น( ) and ๐บ( ), respectively. Thus, a set of data on (๐‘ฅ1 , , ๐‘ฅ๐‘‘ ) can be transformed as(๐‘ฆ1 , , ๐‘ฆ๐‘‘ ) and vice versa via dependent uniform data (๐‘ข1 , ๐‘ข2 , , ๐‘ข๐‘‘ )โ€ฒ created in between.In this study, since our purpose is to normalize multivariate variables, we assume that thecommon copula is a Gaussian copula ฮฆฮผ,ฮฃ ( ), that is,๐ถฮฃ (๐‘ข1 , , ๐‘ข๐‘‘ ) ฮฆฮผ,ฮฃ (ฮฆ 1 (๐‘ข1 ), , ฮฆ 1 (๐‘ข๐‘‘ )),where ฮฆฮผ,ฮฃ ( ) is the cumulative distribution of multivariate normal distribution with meanvector ๐œ‡ and covariance matrix ฮฃ. ฮฆ( ) is the cumulative distribution function of the standardunivariate normal and ฮฆ 1 ( ) is its inverse function.THE ALGORITHMWe start with the missing data problem, for which missingness occurs in one variabledenoted by ๐˜ while the other variables ๐— i 's are fully observed. Then the missing datastructure can be divided into two blocks: (i) the complete cases denoted by (๐˜obs , ๐— cc ) and(ii) incomplete cases denoted by (๐˜mis , ๐— ic ). Let๐˜(๐˜, ๐—) [ ๐จ๐›๐ฌ๐˜๐ฆ๐ข๐ฌ๐— ๐œ๐œ].๐— ๐ข๐œAccording to the above process of copula transformation as stated in Equation (1), weimplement the following algorithm,1. Transform the complete cases (๐˜๐จ๐›๐ฌ , ๐— ๐œ๐œ ) to uniformly distributed data ๐”๐œ๐œ (๐‘ˆ๐‘Œ , ๐‘ˆ๐‘‹1 , ๐‘ˆ๐‘‹2 , , ๐‘ˆ๐‘‹๐‘˜ ) using the empirical cumulative distribution function estimated fromthe data.2. For the incomplete case, transform ๐— ๐ข๐œ to uniformly distributed data ๐”๐ข๐œ (๐‘ˆ๐‘‹1 , ๐‘ˆ๐‘‹2 , , ๐‘ˆ๐‘‹๐‘˜ )using the empirical cumulative distribution function estimated from the data. There is no2

๐‘ˆ๐‘Œ data due to missingness.3. Combine ๐”๐œ๐œ and ๐”๐ข๐œ into ๐”, that is๐” [๐”๐‘๐‘],๐”๐‘–๐‘and convert ๐” to a new dataset (๐˜ , ๐— ) using inverse multivariate normal cumulativedistribution, corresponding to the correlation matrix from the original data. Thus,(๐˜ , ๐— ) [ ๐˜๐จ๐›๐ฌ ๐˜๐ฆ๐ข๐ฌ๐— ๐œ๐œ ].๐— ๐ข๐œAt this stage, after transforming from ๐” to (๐˜ , ๐— ), one of the imputation methods can beapplied on multivariate normally distributed (๐˜ , ๐— ) as in Step 4 below.4. Use one of the imputation procedures (e.g. regression, MCMC, FCS) as desired to impute all missing values of ๐˜๐ฆ๐ข๐ฌ. Multivariate normality of (๐˜ , ๐— ) makes this step easilyimplementable using PROC MI.5. Back-transform the filled-in data to original scale via ๐” according to the chosen copulafunction.It is assumed that the missingness scheme is independent of any such transformation andhence will remain the same all through the transformation.IMPLEMENTATIONWe illustrate the implementation scheme step by step following the above algorithm on asample dataset misData with four variables (๐‘Œ, ๐‘‹1 , ๐‘‹2 , ๐‘‹3 ), which contains missing values in ๐‘Œand fully observed values in ๐‘‹1 , ๐‘‹2 , and ๐‘‹3 . We add an indicator column Flag into the datasetmisData such that Flag 'X' and '.' are for complete and incomplete cases, respectively.Step 1 & 2: Transform complete cases (๐˜๐‘œ๐‘๐‘  , ๐‘ฟ๐‘๐‘ ) and incomplete cases ๐— ๐‘–๐‘ to uniformrandom variables using PROC COPULA, respectively. We specify parameter normal in FITstatement since we use Gaussian copula. The setting marginals empirical indicates thatwe use the empirical cumulative distribution function estimated from the data. The resultingdataset unif cc star is the data set on the transformed uniform variables from completecases (๐˜๐‘œ๐‘๐‘  , ๐— ๐‘๐‘ ), while unif ic star is the data set on the transformed uniform variablesfrom incomplete cases ๐— ๐‘–๐‘ .%let misVar y;%let ccVarList x1 x2 x3;proc copula data misData(where (Flag 'X'));var &misVar &ccVarList;fit normal / marginals empirical outpseudo unif cc noprint;run;proc copula data misData;var &ccVarList;fit normal / marginals empirical outpseudo unif ic noprint;run;data unif cc star;set unif cc;Flag 'X';run;3

data unif ic star(where (Flag '.'));merge unif ic misData(keep Flag);run;Step 3: Combine two datasets unif cc star and unif ic star and transform each ofuniformly distributed column data to standard normal by using quantile function.data unif u;set unif cc star unif ic star;run;data std norm;set unif u;if Flag 'X' then y quantile("Normal", y);x1 quantile("Normal", x1);x2 quantile("Normal", x2);x3 quantile("Normal", x3);run;Step 4: Apply the desired multiple imputation method on the dataset std norm. MCMCmethod is selected as an example in the code given below.proc mi data std norm nimpute 5 out mi std norm seed 1234 noprint;mcmc;var &misVar &ccVarList;run;Step 5: Note that the imputed values in above dataset mi std norm are still in standardnormal scale. The last step is to back-transform the filled-in data to original scale accordingto the copula. This process involves two steps:(a) Simulate a large number (e.g., NSIM 10,000) of observations from multivariate uniformdistribution corresponding to our copula and convert those simulated observations to thedata on variables in original data scale and to the data on variables with standard normaldistribution, respectively. This can be readily simulated by using FIT and SIMULATEstatements in PROC COPULA. The FIT statement setting must be the same as we set in Step1 & 2 since we back-transform the data according to the same copula. The output datasetsim org contains the simulated observations in original scale and sim unif consists of thesimulated observations distributed as multivariate uniform distribution. The datasetsim std norm is the converted data where each variable is distributed as standard normal.%let NSIM 10000;proc copula data misdata(where (Flag 'X'));var &misVar &ccVarList;fit normal / marginals empirical noprint;simulate /ndraws &NSIM seed 1234567out sim org outuniform sim unif;run;data sim std norm;set sim unif;sy quantile("Normal", y);sx1 quantile("Normal", x1);4

sx2 quantile("Normal", x2);sx3 quantile("Normal", x3);keep sy sx1 sx2 sx3;run;(b) Obtain the imputed values in original data scale by interpolation from above simulatedobservations in data sets sim org and sim std norm. Denote the imputed value in Step 4by ๐‘ฆฬ‚,ฬ‚๐‘˜ is sandwiched between two values ๐‘ ๐‘ฆ๐‘ก and๐‘˜ which is in standard normal scale. If ๐‘ฆ๐‘ ๐‘ฆ๐‘ก 1 in dataset sim std norm, then we predict ๐‘ฆ๐‘˜ in its original scale by averaging, ingeneral, by interpolating values corresponding to ๐‘ ๐‘ฆ๐‘ก and ๐‘ ๐‘ฆ๐‘ก 1 in dataset sim org. In theresulting dataset impt org scale, the variable ry with MIS 'Y' are the imputed values inoriginal scale.data sim org;set sim org;keep y;rename y ry;run;data sim std norm(keep sy);set sim std norm;run;proc sort data sim std norm; by sy; run;proc sort data sim org; by ry; run;data sim org std;merge sim std norm sim org;run;/* filter the imputed values in variable y*/data impt std norm;set mi std norm(where (Flag '.'));keep y;rename y sy;run;data impt sim comb;set impt std norm sim org std;run;proc sort data impt sim comb; by sy; run;data impt org scale;merge impt sim comb impt sim comb(keep ry firstobs 2rename (ry lead mis));lag mis lag(ry);if ry . then do;ry mean(lag mis, lead mis);MIS 'Y';end;run;5

AN ILLUSTRATION VIA SIMULATED DATA SETSCOMPLETE DATA GENERATION AND MISSINGNESS MECHANISMUsing the Iman-Conover method given by Wicklin (2013), we generate data sets from thefollowing two multivariate distributions, where marginals of components are as specified inTable 1,Group๐‘‹1๐‘‹2๐‘‹3๐‘‹41Log-normal (0, ๐œŽ)Pareto (1,1)Normal (0,1)Uniform (0,1)2Log-normal (0, ๐œŽ)Normal (0,1)Exponential (1)Uniform (0,1)Table 1. Marginal distributions of simulated data setswhere ๐œŽ was set as 1.0, 2.0 and 3.0. In each case, the following correlation structure wasused,1ฯCorr [ฯฯฯ1ฯฯฯฯ1ฯฯฯ]ฯ1where ฯ was set as 0.9.Missing values are assumed to be missing completely at random (MCAR).EVALUATION OF OUR IMPUTATION METHODWe select ๐‘‹1 as the variate with missing values. The sample size is taken as 100 and thenumber of missing cases as 5. To evaluate the quality of imputation, we simulate eachscenario NSIM 1,000 times and use ๐‘˜ imputation(s). Then we compute the mean of thesum of squared residuals byNSIM๐‘˜521impt(๐‘š)๐‘ก๐‘Ÿ๐‘ข๐‘’MSSR (๐‘‹1๐‘– ๐‘‹1๐‘–) ,NSIM๐‘š 1 ๐‘– 1impt(๐‘š)๐‘ก๐‘Ÿ๐‘ข๐‘’where ๐‘‹1๐‘–is the ๐‘š-th imputed value for the ๐‘–-th missing value ๐‘‹1๐‘– and ๐‘‹1๐‘–is the trueobserved value of ๐‘‹1๐‘– .MULTIPLE IMPUTATION METHODSWe select FCS regression (Van Buuren 2007) and MCMC (Schafer 1997) multiple imputationmethods for multiple imputation. The general idea of FCS regression is to generate ๐‘˜ sets ofpredicted values based on regression model, which involves filled-in phase and imputationphrase. MCMC method is used to generate ๐‘˜ sets of predicted values according to theposterior distributions in Bayesian inference. Both methods require the assumption that thedata are from a multivariate normal distribution. In our illustration, we choose ๐‘˜ 5.SIMULATION RESULTTable 2 and Table 3 give sample summaries using FCS and MCMC methods for original andcopula-transformed data. The column Ratio (O/C) is the ratio of the MSSR values of aboveto respective data. The larger Ratio (O/C) indicates the better performance of copula6

transformation. The column %SSR (O C) is the percent of times our approach results insmaller sum of squared residuals. Accordingly, the larger percentage value indicatessuperior performance of our transformation approach. The following results show that ourapproach performs substantially better than the case when multivariate normality of thedata was blindly assumed. A more detailed extensive simulation work, not reported heredue to lack of space, confirms to the above observations.MethodMSSR๐ˆOriginal(Assumed medRatio(O/C)(O 9%3.067,349,495.8513,494,377.604.9985.5%Table 2 Comparison between original data and copula-transformed data usingmultiple imputation (๐’Œ ๐Ÿ“) for Group 1 with correlation choosing ๐† ๐ŸŽ. ๐Ÿ—MethodMSSR๐ˆOriginal(Assumed medRatio(O/C)(O 235,290.8429,549,312.571.5085.5%Table 3 Comparison between original data and copula-transformed data usingmultiple imputation (๐’Œ ๐Ÿ“) for Group 2 with correlation choosing ๐† ๐ŸŽ. ๐Ÿ—CONCLUSIONWe have introduced a very powerful approach based on copula transformation to impute themissing values for general skewed multivariate data. We provide the algorithm and itsimplementation for one-variate missing pattern. Algorithm can be readily modified for ๐‘˜variate (๐‘˜ 1) missing patterns. In view of ready accessibility of MI and COPULAprocedures, this approach has a very wide scope for practical applications. A complete7

program combining all the procedures pieces is given as the supplementary material. Anexecution of our program results in imputed values (column rx1 in bold) shown in Table 4for five missing values.Obs Imputationsx1sx2sx3sx4Flag Srrx1MIS11-0.71624 -0.88485 -0.62092 -0.39469.10.61905Y21-1.15206 -1.28721 -1.48097 -2.33008.20.42109Y31-0.41595 -0.71397 -0.11191 -0.47658.30.92597Y410.374300.561790.65130 Y0.16202Table 4 The first five output (the imputed values are in column rx1 in bold) ofexecution of sample code in supplementary materialREFERENCESBahuguna, M. and Khattree, R. 2019. โ€œA Generic All Purpose Transformation for MultivariateModeling through Copulas.โ€ To appear in International Journal of Data Science andAnalytics.Little, R.J.A. and Rubin, D.B. 2002. Statistical Analysis with Missing Data. 2nd Edition.Hoboken, NJ: John Wiley & Sons.Rao, C.R., Toutenburg, H., Shalabh, Heumann, C. 2007. Linear Models and Generalizations,Least Squares and Alternatives. 3rd Extended Edition. New York, NY: Springer.SAS Institute 2014. SAS/ETS 13.2 User's Guide The COPULA Procedure. Cary, NC: SASInstitute Inc.SAS Institute 2017. SAS/STAT 14.3 User's Guide The MI Procedure. Cary, NC: SAS InstituteInc.Sklar, A. 1959. "Distribution Functions of n Dimensions and Margins," Publications of theInstitute of Statistics of the University of Paris, 8: 229-231.Schafer, J.L. 1997. Analysis of Incomplete Multivariate Data. London: Chapman & Hall.Van Buuren, S. 2007. โ€œMultiple Imputation of Discrete and Continuous Data by FullyConditional Specification.โ€ Statistical Methods in Medical Research, 16:219-242.Wicklin, R. 2013. Simulating Data with SAS. Cary, NC: SAS Institute Inc.CONTACT INFORMATIONYour comments and questions are valued and encouraged. Contact the author at:Zhixin Lun, PhD StudentDepartment of Mathematics and Statistics, Oakland a Khattree, Professor/Co-directorDepartment of Mathematics and Statistics/Center for Data Science and Big DataAnalytics, Oakland Universitykhattree@oakland.edu8

SAS and all other SAS Institute Inc. product or service names are registered trademarks ortrademarks of SAS Institute Inc. in the USA and other countries. indicates USAregistration.Other brand and product names are trademarks of their respective companies.9

Multiple Imputation for Skewed Multivariate Data: A Marriage of the MI and COPULA Procedures Zhixin Lun, Ravindra Khattree, Oakland University ABSTRACT Missing data is a common phenomenon in various data analyses. Imputation is a flexible method for handling missing-data problems since it efficiently uses all the available information in the data.