An Introduction To Survival Analysis Using Stata

Transcription

An Introduction to Survival AnalysisUsing StataThird EditionMARIO CLEVESDepartment of PediatricsUniversity of Arkansas Medical SciencesWILLIAM GOULDStataCorpROBERTO G. GUTIERREZStataCorpYULIA V. MARCHENKOStataCorp A Stata Press PublicationStataCorp LPCollege Station, Texas

Copyright c 2002, 2004, 2008, 2010 by StataCorp LPAll rights reserved. First edition 2002Revised edition 2004Second edition 2008Third edition 2010Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845Typeset in LATEX 2εPrinted in the United States of America10 9 8 7 6 5 4 3 2 1ISBN-10: 1-59718-074-2ISBN-13: 978-1-59718-074-0No part of this book may be reproduced, stored in a retrieval system, or transcribed, in anyform or by any means—electronic, mechanical, photocopy, recording, or otherwise—withoutthe prior written permission of StataCorp LP.Stata is a registered trademark of StataCorp LP. LATEX 2ε is a trademark of the AmericanMathematical Society.

ContentsList of TablesxiiiList of FiguresxvPreface to the Third EditionxixPreface to the Second EditionxxiPreface to the Revised EditionxxiiiPreface to the First EditionNotation and Typography12xxviiThe problem of survival analysis11.1Parametric modeling . . . . . . . . . . . . . . . . . . . . . . . . . . .21.2Semiparametric modeling . . . . . . . . . . . . . . . . . . . . . . . .31.3Nonparametric analysis . . . . . . . . . . . . . . . . . . . . . . . . .51.4Linking the three approaches . . . . . . . . . . . . . . . . . . . . . .5Describing the distribution of failure times72.1The survivor and hazard functions . . . . . . . . . . . . . . . . . . .72.2The quantile function . . . . . . . . . . . . . . . . . . . . . . . . . . .102.3Interpreting the cumulative hazard and hazard rate . . . . . . . . . .132.3.1Interpreting the cumulative hazard . . . . . . . . . . . . . .132.3.2Interpreting the hazard rate . . . . . . . . . . . . . . . . . .15Means and medians . . . . . . . . . . . . . . . . . . . . . . . . . . . .162.43xxvHazard models193.1Parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . .203.2Semiparametric models . . . . . . . . . . . . . . . . . . . . . . . . . .213.3Analysis time (time at risk) . . . . . . . . . . . . . . . . . . . . . . .24

vi4ContentsCensoring and truncation294.1Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .294.1.1Right-censoring . . . . . . . . . . . . . . . . . . . . . . . . .304.1.2Interval-censoring . . . . . . . . . . . . . . . . . . . . . . . .324.1.3Left-censoring . . . . . . . . . . . . . . . . . . . . . . . . . .34Truncation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .344.2.1Left-truncation (delayed entry) . . . . . . . . . . . . . . . .344.2.2Interval-truncation (gaps) . . . . . . . . . . . . . . . . . . .354.2.3Right-truncation . . . . . . . . . . . . . . . . . . . . . . . .364.2567Recording survival data375.1The desired format . . . . . . . . . . . . . . . . . . . . . . . . . . . .375.2Other formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .405.3Example: Wide-form snapshot data . . . . . . . . . . . . . . . . . . .44Using stset476.1A short lesson on dates . . . . . . . . . . . . . . . . . . . . . . . . . .486.2Purposes of the stset command . . . . . . . . . . . . . . . . . . . . .516.3Syntax of the stset command . . . . . . . . . . . . . . . . . . . . . .516.3.1Specifying analysis time . . . . . . . . . . . . . . . . . . . .526.3.2Variables defined by stset . . . . . . . . . . . . . . . . . . .556.3.3Specifying what constitutes failure . . . . . . . . . . . . . .576.3.4Specifying when subjects exit from the analysis . . . . . . .596.3.5Specifying when subjects enter the analysis . . . . . . . . .626.3.6Specifying the subject-ID variable . . . . . . . . . . . . . . .656.3.7Specifying the begin-of-span variable . . . . . . . . . . . . .676.3.8Convenience options . . . . . . . . . . . . . . . . . . . . . .70After stset737.1Look at stset’s output . . . . . . . . . . . . . . . . . . . . . . . . . .737.2List some of your data . . . . . . . . . . . . . . . . . . . . . . . . . .767.3Use stdescribe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .777.4Use stvary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78

Contents89vii7.5Perhaps use stfill . . . . . . . . . . . . . . . . . . . . . . . . . . . . .807.6Example: Hip fracture data . . . . . . . . . . . . . . . . . . . . . . .82Nonparametric analysis918.1Inadequacies of standard univariate methods. . . . . . . . . . . . .918.2The Kaplan–Meier estimator . . . . . . . . . . . . . . . . . . . . . .938.2.1Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . .938.2.2Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . .968.2.3Left-truncation (delayed entry) . . . . . . . . . . . . . . . .978.2.4Interval-truncation (gaps) . . . . . . . . . . . . . . . . . . .998.2.5Relationship to the empirical distribution function . . . . .998.2.6Other uses of sts list . . . . . . . . . . . . . . . . . . . . . . 1018.2.7Graphing the Kaplan–Meier estimate . . . . . . . . . . . . . 1028.3The Nelson–Aalen estimator . . . . . . . . . . . . . . . . . . . . . . . 1078.4Estimating the hazard function . . . . . . . . . . . . . . . . . . . . . 1138.5Estimating mean and median survival times . . . . . . . . . . . . . . 1178.6Tests of hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228.6.1The log-rank test . . . . . . . . . . . . . . . . . . . . . . . . 1238.6.2The Wilcoxon test . . . . . . . . . . . . . . . . . . . . . . . 1258.6.3Other tests8.6.4Stratified tests . . . . . . . . . . . . . . . . . . . . . . . . . . 126. . . . . . . . . . . . . . . . . . . . . . . . . . . 125The Cox proportional hazards model9.1129Using stcox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1309.1.1The Cox model has no intercept . . . . . . . . . . . . . . . . 1319.1.2Interpreting coefficients . . . . . . . . . . . . . . . . . . . . . 1319.1.3The effect of units on coefficients . . . . . . . . . . . . . . . 1339.1.4Estimating the baseline cumulative hazard and survivorfunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1359.1.5Estimating the baseline hazard function . . . . . . . . . . . 1399.1.6The effect of units on the baseline functions . . . . . . . . . 143

viiiContents9.2Likelihood calculations . . . . . . . . . . . . . . . . . . . . . . . . . . 1459.2.1No tied failures . . . . . . . . . . . . . . . . . . . . . . . . . 1459.2.2Tied failures . . . . . . . . . . . . . . . . . . . . . . . . . . . 148The marginal calculation . . . . . . . . . . . . . . . . . . . . 148The partial calculation . . . . . . . . . . . . . . . . . . . . . 149The Breslow approximation . . . . . . . . . . . . . . . . . . 150The Efron approximation . . . . . . . . . . . . . . . . . . . 1519.2.39.39.49.59.610Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151Stratified analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1529.3.1Obtaining coefficient estimates. . . . . . . . . . . . . . . . 1529.3.2Obtaining estimates of baseline functions . . . . . . . . . . . 155Cox models with shared frailty . . . . . . . . . . . . . . . . . . . . . 1569.4.1Parameter estimation . . . . . . . . . . . . . . . . . . . . . . 1579.4.2Obtaining estimates of baseline functions . . . . . . . . . . . 161Cox models with survey data . . . . . . . . . . . . . . . . . . . . . . 1649.5.1Declaring survey characteristics . . . . . . . . . . . . . . . . 1659.5.2Fitting a Cox model with survey data . . . . . . . . . . . . 1669.5.3Some caveats of analyzing survival data from complexsurvey designs . . . . . . . . . . . . . . . . . . . . . . . . . . 168Cox model with missing data—multiple imputation . . . . . . . . . . 1699.6.1Imputing missing values . . . . . . . . . . . . . . . . . . . . 1719.6.2Multiple-imputation inference . . . . . . . . . . . . . . . . . 173Model building using stcox17710.1Indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17710.2Categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 17810.3Continuous variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 18010.3.1Fractional polynomials . . . . . . . . . . . . . . . . . . . . . 18210.4Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18610.5Time-varying variables . . . . . . . . . . . . . . . . . . . . . . . . . . 18910.5.1Using stcox, tvc() texp() . . . . . . . . . . . . . . . . . . . . 191

Contentsix10.5.210.611Using stsplit . . . . . . . . . . . . . . . . . . . . . . . . . . . 193Modeling group effects: fixed-effects, random-effects, stratification, and clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197The Cox model: Diagnostics11.111.2203Testing the proportional-hazards assumption . . . . . . . . . . . . . 20311.1.1Tests based on reestimation . . . . . . . . . . . . . . . . . . 20311.1.2Test based on Schoenfeld residuals . . . . . . . . . . . . . . 20611.1.3Graphical methods . . . . . . . . . . . . . . . . . . . . . . . 209Residuals and diagnostic measures . . . . . . . . . . . . . . . . . . . 212Reye’s syndrome data . . . . . . . . . . . . . . . . . . . . . 213121311.2.1Determining functional form . . . . . . . . . . . . . . . . . . 21411.2.2Goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . 21911.2.3Outliers and influential points . . . . . . . . . . . . . . . . . 223Parametric models22912.1Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22912.2Classes of parametric models . . . . . . . . . . . . . . . . . . . . . . 23212.2.1Parametric proportional hazards models . . . . . . . . . . . 23312.2.2Accelerated failure-time models . . . . . . . . . . . . . . . . 23912.2.3Comparing the two parameterizations. . . . . . . . . . . . 241A survey of parametric regression models in Stata13.113.2245The exponential model . . . . . . . . . . . . . . . . . . . . . . . . . . 24713.1.1Exponential regression in the PH metric . . . . . . . . . . . 24713.1.2Exponential regression in the AFT metric . . . . . . . . . . 254Weibull regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25613.2.1Weibull regression in the PH metric . . . . . . . . . . . . . . 256Fitting null models . . . . . . . . . . . . . . . . . . . . . . . 26113.2.2Weibull regression in the AFT metric . . . . . . . . . . . . . 26513.3Gompertz regression (PH metric) . . . . . . . . . . . . . . . . . . . . 26613.4Lognormal regression (AFT metric) . . . . . . . . . . . . . . . . . . . 26913.5Loglogistic regression (AFT metric) . . . . . . . . . . . . . . . . . . . 273

x14Contents13.6Generalized gamma regression (AFT metric)13.7Choosing among parametric models . . . . . . . . . . . . . . . . . . . 27814.21613.7.1Nested models . . . . . . . . . . . . . . . . . . . . . . . . . . 27813.7.2Nonnested models . . . . . . . . . . . . . . . . . . . . . . . . 281Postestimation commands for parametric models14.115. . . . . . . . . . . . . 276283Use of predict after streg . . . . . . . . . . . . . . . . . . . . . . . . . 28314.1.1Predicting the time of failure . . . . . . . . . . . . . . . . . 28514.1.2Predicting the hazard and related functions . . . . . . . . . 29114.1.3Calculating residuals . . . . . . . . . . . . . . . . . . . . . . 294Using stcurve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295Generalizing the parametric regression model30115.1Using the ancillary() option . . . . . . . . . . . . . . . . . . . . . . . 30115.2Stratified models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30715.3Frailty models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31015.3.1Unshared frailty models . . . . . . . . . . . . . . . . . . . . 31115.3.2Example: Kidney data . . . . . . . . . . . . . . . . . . . . . 31215.3.3Testing for heterogeneity . . . . . . . . . . . . . . . . . . . . 31715.3.4Shared frailty models . . . . . . . . . . . . . . . . . . . . . . 324Power and sample-size determination for survival analysis16.116.2333Estimating sample size . . . . . . . . . . . . . . . . . . . . . . . . . . 33516.1.1Multiple-myeloma data . . . . . . . . . . . . . . . . . . . . . 33616.1.2Comparing two survivor functions nonparametrically . . . . 33716.1.3Comparing two exponential survivor functions . . . . . . . . 34116.1.4Cox regression models . . . . . . . . . . . . . . . . . . . . . 345Accounting for withdrawal and accrual of subjects . . . . . . . . . . 34816.2.1The effect of withdrawal or loss to follow-up . . . . . . . . . 34816.2.2The effect of accrual . . . . . . . . . . . . . . . . . . . . . . 34916.2.3Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35116.3Estimating power and effect size . . . . . . . . . . . . . . . . . . . . 35916.4Tabulating or graphing results . . . . . . . . . . . . . . . . . . . . . . 360

Contents17xiCompeting risks36517.1Cause-specific hazards . . . . . . . . . . . . . . . . . . . . . . . . . . 36617.2Cumulative incidence functions . . . . . . . . . . . . . . . . . . . . . 36717.3Nonparametric analysis . . . . . . . . . . . . . . . . . . . . . . . . . 36817.417.3.1Breast cancer data . . . . . . . . . . . . . . . . . . . . . . . 36917.3.2Cause-specific hazards . . . . . . . . . . . . . . . . . . . . . 36917.3.3Cumulative incidence functions . . . . . . . . . . . . . . . . 372Semiparametric analysis . . . . . . . . . . . . . . . . . . . . . . . . . 37517.4.1Cause-specific hazards . . . . . . . . . . . . . . . . . . . . . 375Simultaneous regressions for cause-specific hazards . . . . . 37817.4.2Cumulative incidence functions . . . . . . . . . . . . . . . . 382Using stcrreg . . . . . . . . . . . . . . . . . . . . . . . . . . 382Using stcox . . . . . . . . . . . . . . . . . . . . . . . . . . . 38917.5Parametric analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389References393Author index401Subject index405

Preface to the Third EditionThis third edition updates the second edition to reflect the additions to the softwaremade in Stata 11, which was released in July 2009. The updates include syntax andoutput changes. The two most notable differences here are Stata’s new treatment offactor (categorical) variables and Stata’s new syntax for obtaining predictions and otherdiagnostics after stcox.As of Stata 11, the xi: prefix for specifying categorical variables and interactionshas been deprecated. Whereas in previous versions of Stata, you might have typed. xi: stcox i.drug*i.raceto obtain main effects on drug and race and their interaction, in Stata 11 you type. stcox i.drug##i.raceFurthermore, when you used xi:, Stata created indicator variables in your data thatidentified the levels of your categorical variables and interactions. As of Stata 11, thecalculations are performed intrinsically without generating any additional variables inyour data.Previous to Stata 11, if you wanted residuals or other diagnostic measures for Coxregression, you had to specify them when you fit your model. For example, to obtainSchoenfeld residuals you might have typed. stcox age protect, schoenfeld(sch*)to generate variables sch1 and sch2 containing the Schoenfeld residuals for age andprotect, respectively. This has been changed in Stata 11 to be more consistent withStata’s other estimation commands. The new syntax is. stcox age protect. predict sch*, schoenfeldChapter 4 has been updated to describe the subtle difference between right-censoringand right-truncation, while previous editions had treated these concepts as synonymous.Chapter 9 includes an added section on Cox regression that handles missing datawith multiple imputation. Stata 11’s new mi suite of commands for imputing missingdata and fitting Cox regression on multiply imputed data are described. mi is discussedin the context of stcox, but what is covered there applies to streg and stcrreg (whichalso is new to Stata 11), as well.

xxPreface to the Third EditionChapter 11 includes added discussion of three new diagnostic measures after Coxregression. These measures are supported in Stata 11: DFBETA measures of influence,LMAX values, and likelihood displacement values. In previous editions, DFBETAs werediscussed, but they required manual calculation.Chapter 17 is new and describes methods for dealing with competing risks, wherecompeting failure events impede one’s ability to observe the failure event of interest.Discussion focuses around the estimation of cause-specific hazards and of cumulativeincidence functions. The new stcrreg command for fitting competing-risks regressionmodels is introduced.College Station, TexasJuly 2010Mario A. ClevesWilliam W. GouldRoberto G. GutierrezYulia V. Marchenko

Preface to the Second EditionThis second edition updates the revised edition (revised to support Stata 8) to reflectStata 9, which was released in April 2005, and Stata 10, which was released in June 2007.The updates include the syntax and output changes that took place in both versions. Forexample, as of Stata 9 the estat phtest command replaces the old stphtest commandfor computing tests and graphs for examining the validity of the proportional-hazardsassumption. As of Stata 10, all st commands (as well as other Stata commands) acceptoption vce(vcetype). The old robust and cluster(varname) options are replaced withvce(robust) and vce(cluster varname). Most output changes are cosmetic. Thereare slight differences in the results from streg, distribution(gamma), which has beenimproved to increase speed and accuracy.Chapter 8 includes a new section on nonparametric estimation of median and meansurvival times. Other additions are examples of producing Kaplan–Meier curves withat-risk tables and a short discussion of the use of boundary kernels for hazard functionestimation.Stata’s facility to handle complex survey designs with survival models is describedin chapter 9 in application to the Cox model, and what is described there may also beused with parametric survival models.Chapter 10 is expanded to include more model-building strategies. The use of fractional polynomials in modeling the log relative-hazard is demonstrated in chapter 10.Chapter 11 includes a description of how fractional polynomials can be used in determining functional relationships, and it also includes an example of using concordancemeasures to evaluate the predictive accuracy of a Cox model.Chapter 16 is new and introduces power analysis for survival data. It describesStata’s ability to estimate sample size, power, and effect size for the following survivalmethods: a two-sample comparison of survivor functions and a test of the effect of acovariate from a Cox model. This chapter also demonstrates ways of obtaining tabularand graphical output of results.College Station, TexasMarch 2008Mario A. ClevesWilliam W. GouldRoberto G. GutierrezYulia V. Marchenko

8Nonparametric analysisThe previous two chapters served as a tutorial on stset. Once you stset your data,you can use any st survival command, and the nice thing is that you do not have tocontinually restate the definitions of analysis time, failure, and rules for inclusion.As previously discussed in chapter 1, the analysis of survival data can take one ofthree forms—nonparametric, semiparametric, and parametric—all depending on whatwe are willing to assume about the form of the survivor function and about how thesurvival experience is affected by covariates.Nonparametric analysis follows the philosophy of letting the dataset speak for itselfand making no assumption about the functional form of the survivor function (andthus no assumption about, for example, the hazard, cumulative hazard). The effects ofcovariates are not modeled, either—the comparison of the survival experience is doneat a qualitative level across the values of the covariates.Most of Stata’s nonparametric survival analysis is performed via the sts command,which calculates estimates, saves estimates as data, draws graphs, and performs tests,among other things; see [ST] sts.8.1Inadequacies of standard univariate methodsBefore we proceed, however, we must discuss briefly the reasons that the typical preliminary data analysis tools do not translate well into the survival analysis paradigm.For example, the most basic of analyses would be one that analyzed the mean time tofailure or the median time to failure. Let us use the hip-fracture dataset, which westset at the end of chapter 7:91

92Chapter 8 Nonparametric analysis. use http://www.stata-press.com/data/cggm3/hip2(hip fracture study). list id t0 t fracture protect age calcium if 20 id & id 22, Putting aside for now the possible effects of the covariates, if we were interested inestimating the population mean time to failure, we might be tempted to use the standardtools such as. ci tVariableObsMeant10611.5283Std. Err.[95% Conf. Interval].82374989.89495813.16165We might quickly realize that this is not what we want because there are multiplerecords for each individual. We could just consider those values of t corresponding tothe last record for each individual,. sort id t. by id: gen last n N. ci t if lastObsVariablet48MeanStd. Err.[95% Conf. Interval]15.51.48036812.5218818.47812and we now have a mean based on 48 observations (one for each subject). This will notserve, however, because t does not always correspond to failure time—some times inour data are censored, meaning that the failure time in these cases is known only to begreater than t. As such, the estimate of the mean is biased downward.Dropping the censored observations and redoing the analysis will not help. Consideran extreme case of a dataset with just one censored observation and assume the observation is censored at time 0.1, long before the first failure. For all you know, had thatsubject not been censored, the failure might have occurred long after the last failure inthe data and thus had a large effect on the mean. Wherever the censored observation islocated in the data, we can repeat that argument, and so, in the presence of censoring,obtaining estimates of the mean survival time calculated in the standard way is simplynot possible.

8.2.1Calculation93Estimates of the median survival time are similarly not possible to obtain usingstandard nonsurvival tools. The standard way of calculating the median is to order theobservations and to report the middle one as the median. In the presence of censoring,that ordering is impossible to ascertain. (The modern way of calculating the medianis to turn to the calculation of survival probabilities and find the point at which thesurvival probability is 0.5. See section 8.5.)Thus even the most simple analysis—never mind the more complicated regressionmodels—will break down when applied to survival data. Also there are even more issuesrelated to survival data—truncation, for example—that would only further complicatethe estimation.Instead, survival analysis is a field of its own. Given the nature of the role thattime plays in the analysis, much focus is given to the functions that characterize thedistribution of the survival time: the hazard function, the cumulative hazard function,and the survivor function being the most common ways to describe the distribution.Much of survival analysis is concerned with the estimation of and inference for thesefunctions of time.8.28.2.1The Kaplan–Meier estimatorCalculationThe estimator of Kaplan and Meier (1958) is a nonparametric estimate of the survivorfunction S(t), which is the probability of survival past time t or, equivalently, theprobability of failing after t. For a dataset with observed failure times, t1 , . . . , tk , wherek is the number of distinct failure times observed in the data, the Kaplan–Meier estimate[also known as the product limit estimate of S(t)] at any time t is given byY nj dj bS(t) (8.1)njj tj twhere nj is the number of individuals at risk at time tj and dj is the number of failuresat time tj . The product is over all observed failure times less than or equal to t.How does this estimator work? Consider the hypothetical dataset of subjects givenin the usual format,id123456t244578failed111010and form a table that summarizes what happens at each time in our data (whether afailure time or a censored time):

94Chapter 8 Nonparametric analysist24578No. at risk65321No. failed12010No. censored00101At t 2, the earliest time in our data, all six subjects were at risk, but at that instant,only one failed (id 1). At the next time, t 4, five subjects were at risk, but at thatinstant, two failed. At t 5, three subjects were left, and no one failed, but one subjectwas censored. This left us with two subjects at t 7, of which one failed. Finally, att 8, we had one subject left at risk, and this subject was censored at that time.Now we ask the following: What is the probability of survival beyond t 2, the earliest time in our data?Because five of the six subjects survived beyond this point, the estimate is 5/6. What is the probability of survival beyond t 4 given survival right up to t 4?Because we had five subjects at risk at t 4, and two failed, we estimate thisprobability to be 3/5. What is the probability of survival beyond t 5 given survival right up to t 5?Because three subjects were at risk, and no one failed, the probability estimate is3/3 1.and so on. We can now augment our table with these component probabilities (callingthem p):t24578No. at risk65321No. failed12010No. censored00101p5/63/511/21 The first value of p, 5/6, is the probability of survival beyond t 2. The second value, 3/5, is the (conditional) probability of survival beyond t 4given survival up until t 4, which in these data is the same as survival beyondt 4 given survival beyond t 2. Thus unconditionally, the probability ofsurvival beyond t 4 is (5/6)(3/5) 1/2.

8.2.1Calculation95 The third value, 1, is the conditional probability of survival beyond t 5 givensurvival up until t 5, which in these data is the same as survival beyond t 5given survival beyond t 4. Unconditionally, the probability of survival beyondt 5 is thus equal to (1/2)(1) 1/2.Thus the Kaplan–Meier estimate is the running product of the values of p that we havepreviously calculated, and we can add it to our table.tNo. at riskNo. failedNo. 21/21/41/4Because the Kaplan–Meier estimate in (8.1) operates only on observed failure times(and not at censoring times), the net effect is simply to ignore the cases where p 1 incalculating our product; ignoring these changes nothing.In Stata, the Kaplan–Meier estimate is obtained using the sts list command, whichgives a table similar to the one we constructed:. clear. input id time failedidtime1. 1 2 12. 2 4 13. 3 4 14. 4 5 05. 5 7 16. 6 8 07. end. stset time, fail(failed)failed(output omitted ). sts listfailure d: failedanalysis time t: 5000.15210.20410.20410.20410.2041[95% Conf. 0370.64590.6459The column “Beg. Total” is what we called “No. at risk” in our table; the column “Fail”is “No. failed”; and the column “Net lost” is related to our “No. censored” column butis modified to handle delayed entry (see sec. 8.2.3).

96Chapter 8 Nonparametric analysisThe standard error reported for the Kaplan–Meier estimate is that given by Greenwood’s (1926) formula:d S(t)}bVar{ Sb2 (t)Xj tj tdjnj (nj dj )(8.2)These standard errors, however, are not used for confidence intervals. Instead, thebasymptotic variance of ln{ ln S(t)},2Pσb (t) nPdjnj (nj dj )ln nj djdj o2is used, where the sums are calculated over j such that tj t (Kalbfleisch and Prenticeb2002, 18). The confidence bounds are then calculated as S(t)raised to the powerexp{ zα/2 σb(t)}, where zα/2 is the (1 α/2) quantile of the standard normal distribution.8.2.2CensoringWhen censoring occurs at some time other than an observed failure time, for a differentsubject the effect is simply that the censored subjects are dropped from the “No. atrisk” total without processing the censored subject as having failed. However, whensome subjects are censored at the same time that others fail, we need to be a bit carefulabout how we order the censorings and failures. When we went through the calculationsof the Kaplan–Meier estimate in section 8.2.1, we did so without explaining this point,yet be assured that we were following some convention.The Stata convention for handling a censoring that happens at the same time as afailure is to assume that the failure occurred before the censoring, and in fact, all Stata’sst commands follow this rule. In chapter 7, we defined a time span based on the stsetvariables t0 and t to be the interval (t0 , t ], which is open at the left endpoint andclosed at the right endpoint. Therefore, if we apply this definition of a time span, thenany record shown to be censored at the end of this span can be thought of as insteadbeing censored at some time t ǫ for an arbitrarily small ǫ. The subject can fail at timet, but if the subject is censored, then Stata assumes that the censoring took place justa little bit later; thus failures occur before censorings.This is how Stata handles this issue, but there is nothing wrong with the conventionthat handles censorings as occurring before failures when they appear to happen concurrently. One can force Stata to look at things this way by subtracting a small numberfrom the time variable in your data for those records that are censored, and most of thetime the number may be chosen small enough as to not otherwise affect the analysis.

8.2.3Left-truncation (d

The two most notable differences here are Stata's new treatment of factor (categorical) variables and Stata's new syntax for obtaining predictions and other diagnostics after stcox. As of Stata 11, the xi: prefix for specifying categorical variables and interactions has been deprecated. Whereas in previous versions of Stata, you might .