Large-scale Parametric Survival Analysis

Transcription

Research ArticleReceived 21 June 2012,Accepted 18 March 2013Published online 28 April 2013 in Wiley Online Library(wileyonlinelibrary.com) DOI: 10.1002/sim.5817Large-scale parametric survival analysisSushil Mittal,a * † David Madigan,a Jerry Q. Chengb andRandall S. BurdcSurvival analysis has been a topic of active statistical research in the past few decades with applications spreadacross several areas. Traditional applications usually consider data with only a small numbers of predictors witha few hundreds or thousands of observations. Recent advances in data acquisition techniques and computationpower have led to considerable interest in analyzing very-high-dimensional data where the number of predictorvariables and the number of observations range between 104 and 106 . In this paper, we present a tool forperforming large-scale regularized parametric survival analysis using a variant of the cyclic coordinate descentmethod. Through our experiments on two real data sets, we show that application of regularized models tohigh-dimensional data avoids overfitting and can provide improved predictive performance and calibration overcorresponding low-dimensional models. Copyright 2013 John Wiley & Sons, Ltd.Keywords:survival analysis; parametric models; regularization; penalized regression; pediatric trauma1. IntroductionRegression analysis of time-to-event data occupies a central role in statistical practice [1, 2] with applications spread across several fields including biostatistics, sociology, economics, demography, andengineering [3–6]. Newer applications often gather high-dimensional data that present computationalchallenges to existing survival analysis methods. As an example, new technologies in genomics haveled to high-dimensional microarray gene expression data where the number of predictor variables is ofthe order of 105 . Other large-scale applications include medical adverse event monitoring, longitudinalclinical trials, and business data mining tasks. All of these applications require methods for analyzinghigh-dimensional data in a survival analysis framework.In this paper, we consider high-dimensional parametric survival regression models involving bothlarge numbers of predictor variables and large numbers of observations. Although Cox models continueto attract much attention [7], parametric survival models have always been a popular choice amongstatisticians for analyzing time-to-event data [4, 6, 8, 9]. Parametric survival models feature prominentlyin commercial statistical software, are straightforward to interpret, and can provide competitive predictive accuracy. To bring all these advantages to high-dimensional data analysis, these methods need to bescaled to data involving 104 –106 predictor variables and even larger numbers of observations.Computing the maximum likelihood fit of a parametric survival model requires solving a nonlinearoptimization problem. Standard implementations work well for small-scale problems. Because theseapproaches typically require matrix inversion, solving large-scale problems using standard softwareis typically impossible. One possible remedy is to perform feature selection as a preprocessing step.Although feature selection does reduce memory and computational requirements and also serves as apractical solution to overfitting, it introduces new problems. First, the statistical consequences of mostfeature selection methods remain unclear, making it difficult to choose the number of features for aa Departmentof Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY 10027, U.S.A.Institute, UMDNJ-Robert Wood Johnson Medical School, 125 Patterson Street, Clinical Academic Building,New Brunswick, NJ 08901, U.S.A.c Division of Trauma and Burn Surgery, Joseph E. Robert Jr. Center for Surgical Care, Children’s National Medical Center,Washington, DC, U.S.A.*Correspondence to: Sushil Mittal, Department of Statistics, Columbia University, 1255 Amsterdam Avenue, New York, NY10027, U.S.A.† E-mail: mittal@stat.columbia.edub CardiovascularStatist. Med. 2013, 23 3955–39713955Copyright 2013 John Wiley & Sons, Ltd.

S. MITTAL ET AL.given task in a principled way. Second, the most efficient feature selection methods are greedy and maychoose redundant or ineffective combinations of features. Finally, it is often unclear how to combineheuristic feature selection methods with domain knowledge. Even when standard software does produce estimates, numerical ill conditioning can result in a lack of convergence, large estimated coefficientvariances, and poor predictive accuracy or calibration.In this paper, we describe a regularized approach to parametric survival analysis [10]. The main ideais the use of a regularizing prior probability distribution for the model parameters that favors sparsenessin the fitted model, leading to point estimates being zero for many of the model parameters. To solvethe optimization problem, we use a variation of the cyclic coordinate descent method [11–14]. We showthat application of this type of model to high-dimensional data avoids overfitting, can provide improvedpredictive performance and calibration over corresponding low-dimensional models, and is efficient bothduring fitting and at prediction time.In Section 2, we review some of the other works that address similar problems. In Section 3, wedescribe the basics of a regularized approach to parametric survival analysis. In Section 4, we providea brief overview of the four parametric models used for survival analysis using a high-dimensional formulation. We also describe our optimization technique, which is tailored for each specific model forcomputing point estimates of model parameters. More involved algorithmic details of the method pertaining to each of the four parametric models can be found in Appendix A. We describe the data sets andmethods that we use in our experiments in Section 5 and provide experimental results in Section 6. Thefirst application uses a large data set of hospitalized injured children for developing a model for predicting survival. Through our experiments, we establish that an analysis that uses our proposed approach canadd significantly to predictive performance as compared with the traditional low-dimensional models.In the second application, we apply our method to a publicly available breast cancer gene expressiondata set and show that the high-dimensional parametric model can achieve performance similar to a lowdimensional Cox model while being better calibrated. Finally, we conclude in Section 8 with directionsfor future work.We have publicly released the C implementation of our algorithm, which can be downloaded from http://code.google.com/p/survival-analysis-cmake/. This code has been derived from thewidely used BBR/BXR software for performing large-scale Bayesian logistic regression (http://www.bayesianregression.org). All four parametric models discussed in this paper are included in the implementation; however, it is fairly straightforward to also extend it to other parametric models. Thecomputation of various evaluation metrics discussed in Section 5 is also integrated into the code.2. Related work3956Survival analysis is an old subject in statistics that continues to attract considerable research attention.Traditionally, survival analysis is concerned with the study of survival times in clinical and health-relatedstudies [4, 15]. However, over the past several decades, survival analysis has found an array of applications ranging from reliability studies in industrial engineering to analyses of inter-child birth times indemography and sociology [3]. Other application areas have also benefited from the use of these methods[6, 8].Earlier applications usually consisted of a relatively small number of predictors (usually less than20) with a few hundred or sometimes a few thousand examples. Recently, there has been considerableinterest in analyzing high-dimensional time-to-event problems. For instance, a large body of work hasfocused on methodologies to handle the overwhelming amount of data generated by new technologies inbiology such as gene expression microarrays and single nucleotide polymorphism data. The goal of thework in high-dimensional survival analysis has been both to develop new statistical methods [16–21] andto extend the existing methods to handle new data sets [13,22–24]. For example, recent work [16,18] hasextended the traditional support vector machines used for regression to survival analysis by additionallypenalizing discordant pairs of observations. Another method [19, 20] extends the use of random forestsfor variable selection for survival analysis. Similarly, other methods [22, 25] have used an elastic netapproach for variable selection both under the Cox proportional hazards model and under an acceleratedfailure time model. This method is similar to another work [24] that applies an efficient method to compute L1-penalized parameter estimates for Cox models. A recent review [26] provides a survey of theexisting methods for variable selection and model estimation for high-dimensional data.Some more recent tools such as coxnet [27] and fastcox [25] adopt optimization approachesthat can scale to the high-dimensional high-sample-size data that we focus on. Although other modelsCopyright 2013 John Wiley & Sons, Ltd.Statist. Med. 2013, 23 3955–3971

S. MITTAL ET AL.provided by the R package glmnet do support sparse formats, neither coxnet nor fastcox currently supports a sparse matrix format for the input data. However, both coxnet and fastcox provideestimates for the Cox proportional hazards model, and fastcox supports elastic net regularization.3. Regularized survival analysisDenote by n the number of individuals in the training data. We represent their survival times byyi D min.ti ; ci /; i D 1; : : : ; n, where ti and ci are the time to event (failure time) and right-censoringtime for each individual, respectively. Let ıi D I.ti 6 ci / be the indicator variable such that ıi is 1if the observation is not censored and 0 otherwise. Further, let xi D Œxi1 ; xi 2 ; : : : ; xip be a p-vectorof covariates. We assume that ti and ci are conditionally independent given xi and that the censoringmechanism is noninformative. The observed data comprise triplets D D f.yi ; ıi ; xi / W i D 1; : : : ; ng.Let be the set of unknown, underlying model parameters. We assume that the survival timesy1 ; y2 ; : : : ; yn arise in an independent and identically distributed fashion from density and survival functions f .yj / and S.yj /, respectively, parametrized by . We are interested in the likelihood L. jD/of the parametric model, whereL. jD/ DnYf .yi j /ıi S.yi j /.1 ıi / :(1)i D1We analyze and compare the performance of four different parametric models by modeling the distributions of the survival times using exponential, Weibull, log-logistic, or lognormal distributions. Each ofthese distributions can be fully parametrized by the parameter pair D . ; /. Typically, the parameter is reparametrized in terms of the covariates x D Œx1 ; x2 ; : : : ; xp and the vector ˇ D Œˇ1 ; ˇ2 ; : : : ; ˇp such that i D ˇ xi ; i D 1; : : : ; n. In general, the mapping function . / is different for eachmodel, and standard choices exist. The likelihood function of (1) in terms of the new parameters can bewritten asL.ˇ; jD/ DnYf .yi j i ; /ıi S.yi j i ; /.1 ıi / :(2)i D1The parameters ˇ and are estimated by maximizing their joint posterior densityL.ˇ; / / L.ˇ; jD/ .ˇ/ . /:(3)The joint posterior distribution of .ˇ; / does not usually have a closed-form solution, but it can beshown that the conditional posterior distributions .ˇj ; D/ and . jˇ; D/ are concave and thereforecan be solved efficiently. In practice, it is sufficient to estimate ˇ and by maximizing the conditionalposterior of just ˇL.ˇ/ / L.ˇ; jD/ .ˇ/:(4)For a model to generalize well to unseen test data, it is important to avoid overfitting the training data.In the Bayesian paradigm, this goal can be achieved by specifying appropriate prior distribution on ˇsuch that each ˇ j is likely to be near 0. As we will observe in the following sections, both Gaussian andLaplacian priors fall under this category. Because we focus on posterior mode estimation in this paper,one can view our procedure as Bayesian or simply as a form of regularization or penalization. We usethe Bayesian terminology in part because we view fully Bayesian computation as the next desirable stepin large-scale survival analysis. We return to this point in the conclusion section.3.1. Gaussian priors and ridge regressionFor L2 regularization, we assume a Gaussian prior for each ˇj with 0 mean and variance j , that is,!ˇj21exp :(5) .ˇj j j / D N.0; j / D p2 j2 jCopyright 2013 John Wiley & Sons, Ltd.Statist. Med. 2013, 23 3955–39713957The mean of 0 encodes a prior preference for values of ˇj that are close to 0. The variances j are positive constants that control the degree of regularization and are typically chosen through cross-validation.

S. MITTAL ET AL.Smaller values of j imply a stronger belief that ˇj is being close to 0 whereas larger values imposea less-informative prior. In the simplest case, we assume that 1 D 2 D : : : D p . Assuming that thecomponents of ˇ are independent a priori, the overall prior Qfor ˇ can be expressed as the product ofthe priors of each individual ˇj , that is, .ˇj 1 ; : : : ; p / D pjD1 .ˇj j j /. Finding the maximum aposteriori estimate of ˇ with this prior is equivalent to performing a ridge regression [28]. Note thatalthough the Gaussian prior favors the value of ˇj close to 0, posterior modes are generally not exactlyequal to 0 for any ˇ j .3.2. Laplacian prior and lasso regressionFor L1 regularization, we again assume that each ˇj follows a Gaussian distribution with mean 0and variance j . However, instead of fixed j ’s, we assume that each j arises from an exponentialdistribution parametrized using j ’s and having a density of . j j j / D jjexp j :22(6)Integrating out j gives an equivalent nonhierarchical, double-exponential (Laplace) distribution with adensity ofp jp .ˇj j j / Dexp . j jˇj j/:2(7)Again, in the simplest case, we assume that 1 D 2 D : : : D p .QSimilar to the Gaussian prior, assumingthat the components of ˇ are independent, .ˇj 1 ; : : : ; p / D pjD1 .ˇj j j /. Finding the maximuma posteriori estimate of ˇ with the Laplacian prior is equivalent to performing a Lasso regression [29].With this approach, a sparse solution typically ensues, meaning the posterior mode for many componentsof the ˇ vector will be 0.4. Parametric models3958It can be shown that for both Gaussian and Laplacian regularization, the negated log-posterior of (4) forall the four parametric models considered in our work are log-convex. Therefore, a wide range of optimization algorithms can be used. Because of the high dimensionality of our target applications, the usualmethods such as Newton–Raphson cannot be used because of their high memory requirements. Manyalternate optimization approaches have been proposed for maximum a posteriori estimation for highdimensional regression problems [30, 31]. We use the Combined Local and Global (CLG) algorithm of[30], which is a type of cyclic coordinate descent algorithm, because of its favorable property of scaling to high-dimensional data and ease of implementation. This method was successfully adapted to thelasso [14] for performing large-scale logistic regression and implemented in the widely used BBR/BXRsoftware (http://www.bayesianregression.org).A cyclic coordinate descent algorithm begins by setting all p parameters ˇj ; j D 1; : : : ; p , to someinitial value. It then sets the first parameter to a value that minimizes the objective function, holding allother parameters constant. This problem then is a one-dimensional optimization. The algorithm finds theminimizing value of a second parameter, while holding all other values constant (including the new valueof the first parameter). The third parameter is then optimized, and so on. When all variables have beentraversed, the algorithm returns to the first parameter and starts again. Multiple passes are made over theparameters until some convergence criterion is met. We note that similar to previous work [14], insteadof iteratively updating each parameter till convergence, we do it only once before proceeding on to thenext parameter. Because the optimal values of the other parameters are themselves changing, tuning aparticular parameter to very high precision, in each pass of the algorithm, is not necessary. For moredetails of the CLG method, we refer readers to relevant publications in this area [14, 30]. The details ofour algorithm for different parametric models are described in Appendix A. We also note that for the caseof the Laplacian prior, the derivative of the negated log-posterior is undefined for ˇj D 0; j D 1; : : : ; p.Section 4.3 of [14] describes the modification of the CLG algorithm that we utilized to address this issue.Copyright 2013 John Wiley & Sons, Ltd.Statist. Med. 2013, 23 3955–3971

S. MITTAL ET AL.5. ExperimentsWe tested our algorithm for large-scale parametric survival analysis on two different data sets. In thefollowing, we briefly describe the data sets and also motivate their choice for our work. In both cases,we compare the performance of the high-dimensional parametric model trained on all predictors withthat of a low-dimensional model trained on a small subset of predictors.5.1. Pediatric trauma dataTrauma is the leading cause of death and acquired disability in children and adolescents. Injuries resultin more deaths in children than all other causes combined [32]. To provide optimal care, children at highrisk for mortality need to be identified and triaged to centers with the resources to manage these patients.The overall goal of our analysis is to develop a model for predicting mortality after pediatric injury.For prediction within the domain of trauma, the literature has traditionally focused on low-dimensionalanalysis, that is, modeling using a small set of features from the injury scene or emergency department [33, 34]. Although approaches that use low-dimensional data to predict outcome may be easier toimplement, the trade-off may be poorer predictive performance.We obtained our data set from the National Trauma Data Bank (NTDB), a trauma database maintained by the American College of Surgeons. The data set includes 210;555 patient records of injuredchildren aged 15 years collected over 5 years (2006–2010). We divided these data into a training dataset (153;402 patients for years 2006–2009) and a testing data set (57;153 patients for year 2010). Themortality rate of the training set is 1:68%, whereas that of the test set is 1:44%. There are a total of125;952 binary predictors indicating the presence or absence of a particular attribute (or interactionamong various attributes). The high-dimensional model was trained using all the 125;952 predictors,whereas the low-dimensional model used only 41 predictors. The information about various predictorsused for both high-dimensional and low-dimensional models is summarized in Table I.5.2. Breast cancer gene expression dataFor our second application, we analyzed the well-known breast cancer gene expression data set [35, 36].This data set is publicly available and consists of cDNA expression profiles of 295 tumor samples ofpatients with breast cancer. These patients were diagnosed with breast cancer between 1984 and 1995 atthe Netherlands Cancer Institute and were aged 52 years or younger at the time of diagnosis. Overall, 79(26:78%) patients died during the follow-up time, and the remaining 216 are censored. The total numberof predictors (number of genes) is 24;885, each of which represents the log ratio of the intensities of thetwo color dyes used for a specific gene. The high-dimensional model was trained using all the 24,885predictors. For the low-dimensional model, we used the glmpath (coxpath) package of R [37] andgenerated the entire regularized paths and output predictors in the order of their relative importance.Table I. Description of the predictors used for high-dimensional and low-dimensional models for pediatrictrauma data.Predictor typeMain effectsICD-9 codesAIS codes (predots)Interactions/combinationsICD-9, ICD-9AIS code, AIS codeBody region, AIS score1890349102;28420;80941579Copyright 2013 John Wiley & Sons, Ltd.DescriptionHigh-dimInternational Classification of Disease,Ninth RevisionAbbreviated Injury Scale codes that includebody region, anatomic structure associatedwith the injury, and the level of injuryXCo-occurrences of two ICD-9 injury codesCo-occurrences of two AIS codesCombinations of any of the nine body regionsassociated to an injury with injury severityscore (between 1 and 6) determined according tothe AIS coding schemeCo-occurrences of two [body region, AISscore] combinationsXXXLow-dimXXXStatist. Med. 2013, 23 3955–39713959[Body region, AIS score],[Body region, AIS score]# Predictors

S. MITTAL ET AL.We picked the top five predictors from the list and trained a low-dimensional model for comparisons.The data were then randomly split into training .67%/ and testing .33%/ sets such that the mortality inboth parts was equal to the original mortality rate.5.3. Hyperparameter selectionThe Gaussian and Laplacian priors both require a prior variance, j2 ; j D 1; : : : ; p, for parameter values.pppThe actual hyperparameters are D j D j2 for Gaussian and D j D 2 j for Laplacian. Forboth the applications, the regularization parameter was selected using a fourfold cross-validation on thetraining data. The variance j2 was varied between 10 5 and 106 by multiples of 10. For the Gaussian6loss, this amounts to varying the actual regularization parameter between 10 5 andp10 by multiples of10, whereas that for the Laplacian was between 0.0014 and 447.21 in multiples of 10. For each choiceof the hyperparameter, we computed the sum of log-likelihoods for the patients in all the four validationsets and chose the hyperparameter value that maximized this sum.5.4. Performance evaluationThere are several ways to compare the performance of fitted parametric models. These evaluation metricscan be divided into two categories—ones that assess the discrimination quality of the model and othersthat assess the calibration. As for any other regression model, the log-likelihood on the test data is anobvious choice for measuring the performance of penalized parametric survival regression. In the pastseveral years, survival analysis-specific metrics that take censoring into account have been proposed.Among these metrics, the area under the ROC curve (AUC), also known as Harrell’s c-statistic [38, 39],a measure of discriminative accuracy of the model, has become one of the important criteria used toevaluate the performance of survival models [40]. Motivated from a similar test for logistic regressionmodel proposed by Hosmer and Lemeshow, we evaluated calibration using an overall goodness-of-fittest that has been proposed for survival data [41, 42]. Although the original test was proposed for Coxmodels, its use for parametric models has also been described [6, Chapter 8]. In the following, we brieflydescribe these two metrics.5.4.1. Harrell’s c-statistic. Harrell’s c-statistic is an extension of the traditional area under the ROCcurve statistic but is more suited for time-to-event data because it is independent of any thresholdingprocess. The method is based on the comparison of estimated and ground truth ordering of risks betweenpairs of comparable subjects. Two subjects are said to be comparable if at least one of the subjects inthe pair has developed the event (e.g., death) and if the follow-up time duration for that subject is lessthan that of the other. By using the notation of Section 3, the comparability of an ordered pair of subjects.i; j / can be represented by the indicator variable ij D I.yi yj ; ıi D 1/, such that i;j is 1 whenthe two subjects are comparable and 0 otherwise. The total number of comparable pairs in the test datacontaining n subjects can then be computed asn Dn XnXi;j :(8)i D1 j D1j iTo measure the predicted discrimination, the concordance of the comparable pairs is then estimated. Apair of comparable subjects (as defined earlier) is said to be concordant if the estimated risk of the subjectwho developed the event earlier is more than that of the other subject. Therefore, the concordance of theordered subject pair .i; j / can also be represented using the indicator variable ij D I. ij D 1; ri rj /,where ri D ˇ xi and rj D ˇ xj are the relative risk scores of the ith and j th subjects. Therefore,ij is 1 when the two subjects are concordant and 0 otherwise. The total number of concordant pairs canthen be written as3960n Dnn XXi;j :(9)i D1 j D1j iCopyright 2013 John Wiley & Sons, Ltd.Statist. Med. 2013, 23 3955–3971

S. MITTAL ET AL.Finally, the c-statistic is given byc-statistic Dn :n (10)5.4.2. Hosmer–Lemeshow statistic. To assess the overall goodness of fit using this test, the subjects arefirst sorted in order of their relative risk score (ri D ˇ xi ; i D 1; : : : ; n) and divided into G equalsized groups. The observed number of events og of the gth group is obtained by summing the numberof noncensored observations in that group, whereas the number of expected events eg is computed bysumming the cumulative hazard of all the subjects in the same group. The 2 statistic for the overallgoodness of fit is then given by2DGX.og eg /2:eggD1(11)Table II. Comparison of low-dimensional and high-dimensional models for pediatric trauma data withGaussian penalization.PredictorsModel 95241101,733 4270.01 4372.410.880.94124.39543.2841125,95241101,794 4242.38 4557.100.880.94131.60749.2241125,95241100,889 4120.66 3765.450.890.9495.3795.0241125,9524188,244 3234.00 3129.020.890.9376.95165.68The number of selected predictors refers to the number of significant predictors .jˇj j 10 4 /. Bold emphases representsuperior performance of one model over the other.Table III. Comparison of low-dimensional and high-dimensional models for pediatric trauma data withLaplacian penalization.PredictorsModel 3 4271.23 4034.670.880.92122.5894.3441125,95241151 4243.57 3997.280.880.92126.84107.9941125,95241432 4122.07 3777.830.890.9494.7383.0041125,95241168 3236.79 2974.490.890.9380.7189.36Bold emphases represent superior performance of one model over the other.Copyright 2013 John Wiley & Sons, Ltd.Statist. Med. 2013, 23 ow-dimHigh-dim2Overall

S. MITTAL ET AL.6. ResultsWe now compare the performance of the low-dimensional and high-dimensional parametric models onboth data sets. For both applications, the hyperparameter 2 was selected by performing a fourfoldcross-validation on the training data.6.1. Pediatric trauma dataTables II and III summarize the results for the pediatric trauma data set for low-dimensional and highdimensional models using all the four parametric models with Gaussian and Laplacian penalization.Note that under the L2 prior, the estimate for any ˇj is never exactly 0, and thus, all variables contributetowards the final model. The number of selected predictors in Table II just refers to the number of significant predictors .jˇj j 10 4 /. The Hosmer–Lemeshow 2 index was computed using the value ofG D 50. Both the discriminative measures (log-likelihood and c-statistic) are always significantly betterfor high-dimensional models than for the corresponding low-dimensional models. In many cases, thehigh-dimensional model is also better calibrated.Table IV. Comparison of number of events in low-risk, medium-risk, and high-risk groups for variouslow-dimensional and high-dimensional models using Gaussian penalization for pediatric trauma data.Low-dimHigh-dimModel typeRisk group# Subjects# EventsMST# .6616397706.563.233.62MST, mean observed survival time in days.Table V. Comparison of number of events in low-risk, medium-risk, and high-risk groups for variouslow-dimensional and high-dimensional models using Laplacian penalization for pediatric trauma data.Low-dimHigh-dim3962Model typeRisk group# Subjects# EventsMST# 3.6712257883.002.963.69MST, mean observed survival time in days.Copyright 2013 John Wiley & Sons, Ltd.Statist. Med. 2013, 23 3955–3971

S. MITTAL ET AL.Table VI. Comparison of low-dimensional and high-dimensional models for gene expression data withGaussian penalization.PredictorsModel dim2Log-likelihoodc-Statistic524,344 86.26 98.720.710.7516.83112.69524,496522,299 85.64 85.560.710.7012.799.34524,496522,090 85.65 86.140.700.7014.745.37524,496524,154 52.10 65.140.700.6616.0223.61OverallSelected524,496The number of selected predictors refers to the number of significant predictors .jˇj j 10 4 /. Bold emphases representsuperior performance of one model

Survival analysis is an old subject in statistics that continues to attract considerable research attention. Traditionally, survival analysis is concerned with the study of survival times in clinical and health-related studies [4,15]. However, over the past several decades, survival analysis has found an array of applica-