Biostatistics - Ysu.am

Transcription

BiostatisticsA Methodology for the Health SciencesSecond EditionGERALD VAN BELLELLOYD D. FISHERPATRICK J. HEAGERTYTHOMAS LUMLEYDepartment of Biostatistics andDepartment of Environmental andOccupational Health SciencesUniversity of WashingtonSeattle, WashingtonA JOHN WILEY & SONS, INC., PUBLICATION

Copyright 2004 by John Wiley & Sons, Inc. All rights reserved.Published by John Wiley & Sons, Inc., Hoboken, New Jersey.Published simultaneously in Canada.No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form orby any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the priorwritten permission of the Publisher, or authorization through payment of the appropriate per-copy fee tothe Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400, fax978-646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission should beaddressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030,(201) 748-6011, fax (201) 748-6008.Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts inpreparing this book, they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose. No warranty may be created or extended by salesrepresentatives or written sales materials. The advice and strategies contained herein may not be suitablefor your situation. You should consult with a professional where appropriate. Neither the publisher norauthor shall be liable for any loss of profit or any other commercial damages, including but not limited tospecial, incidental, consequential, or other damages.For general information on our other products and services please contact our Customer Care Departmentwithin the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,however, may not be available in electronic format.Library of Congress Cataloging-in-Publication Data:Biostatistics: a methodology for the health sciences / Gerald van Belle . . . [et al.]– 2nd ed.p. cm. – (Wiley series in probability and statistics)First ed. published in 1993, entered under Fisher, Lloyd.Includes bibliographical references and index.ISBN 0-471-03185-2 (cloth)1. Biometry. I. Van Belle, Gerald. II. Fisher, Lloyd, 1939– Biostatistics. III. Series.QH323.5.B562 2004610′ .1′ 5195–dc222004040491Printed in the United States of America.10 9 8 7 6 5 4 3 2 1

Ad majorem Dei gloriam

ContentsPreface to the First EditionixPreface to the Second Editionxi1.Introduction to Biostatistics12.Biostatistical Design of Medical Studies103.Descriptive Statistics254.Statistical Inference: Populations and Samples615.One- and Two-Sample Inference1176.Counting Data1517.Categorical Data: Contingency Tables2088.Nonparametric, Distribution-Free, and Permutation Models:Robust Procedures253Association and Prediction: Linear Models with OnePredictor Variable29110.Analysis of Variance35711.Association and Prediction: Multiple Regression Analysisand Linear Models with Multiple Predictor Variables42812.Multiple Comparisons52013.Discrimination and Classification55014.Principal Component Analysis and Factor Analysis5849.vii

viiiCONTENTS15.Rates and Proportions64016.Analysis of the Time to an Event: Survival Analysis66117.Sample Sizes for Observational Studies70918.Longitudinal Data Analysis72819.Randomized Clinical Trials76620.Personal Postscript787Appendix817Author Index841Subject Index851Symbol Index867

Preface to the First EditionThe purpose of this book is for readers to learn how to apply statistical methods to the biomedicalsciences. The book is written so that those with no prior training in statistics and a mathematicalknowledge through algebra can follow the text—although the more mathematical training onehas, the easier the learning. The book is written for people in a wide variety of biomedical fields,including (alphabetically) biologists, biostatisticians, dentists, epidemiologists, health servicesresearchers, health administrators, nurses, and physicians. The text appears to have a dauntingamount of material. Indeed, there is a great deal of material, but most students will not cover itall. Also, over 30% of the text is devoted to notes, problems, and references, so that there is notas much material as there seems to be at first sight. In addition to not covering entire chapters,the following are optional materials: asterisks ( ) preceding a section number or problem denotemore advanced material that the instructor may want to skip; the notes at the end of each chaptercontain material for extending and enriching the primary material of the chapter, but this maybe skipped.Although the order of authorship may appear alphabetical, in fact it is random (we tossed a faircoin to determine the sequence) and the book is an equal collaborative effort of the authors. Wehave many people to thank. Our families have been helpful and long-suffering during the writingof the book: for LF, Ginny, Brad, and Laura; for GvB, Johanna, Loeske, William John, Gerard,Christine, Louis, and Bud and Stacy. The many students who were taught with various versionsof portions of this material were very helpful. We are also grateful to the many collaboratinginvestigators, who taught us much about science as well as the joys of collaborative research.Among those deserving thanks are for LF: Ed Alderman, Christer Allgulander, Fred Applebaum,Michele Battie, Tom Bigger, Stan Bigos, Jeff Borer, Martial Bourassa, Raleigh Bowden, BobBruce, Bernie Chaitman, Reg Clift, Rollie Dickson, Kris Doney, Eric Foster, Bob Frye, BernardGersh, Karl Hammermeister, Dave Holmes, Mel Judkins, George Kaiser, Ward Kennedy, TomKillip, Ray Lipicky, Paul Martin, George McDonald, Joel Meyers, Bill Myers, Michael Mock,Gene Passamani, Don Peterson, Bill Rogers, Tom Ryan, Jean Sanders, Lester Sauvage, RainerStorb, Keith Sullivan, Bob Temple, Don Thomas, Don Weiner, Bob Witherspoon, and a largenumber of others. For GvB: Ralph Bradley, Richard Cornell, Polly Feigl, Pat Friel, Al Heyman,Myles Hollander, Jim Hughes, Dave Kalman, Jane Koenig, Tom Koepsell, Bud Kukull, EricLarson, Will Longstreth, Dave Luthy, Lorene Nelson, Don Martin, Duane Meeter, Gil Omenn,Don Peterson, Gordon Pledger, Richard Savage, Kirk Shy, Nancy Temkin, and many others.In addition, GvB acknowledges the secretarial and moral support of Sue Goleeke. There weremany excellent and able typists over the years; special thanks to Myrna Kramer, Pat Coley, andJan Alcorn. We owe special thanks to Amy Plummer for superb work in tracking down authorsand publishers for permission to cite their work. We thank Robert Fisher for help with numerousfigures. Rob Christ did an excellent job of using LATEX for the final version of the text. Finally,several people assisted with running particular examples and creating the tables; we thank BarryStorer, Margie Jones, and Gary Schoch.ix

xPREFACE TO THE FIRST EDITIONOur initial contact with Wiley was the indefatigable Beatrice Shube. Her enthusiasm forour effort carried over to her successor, Kate Roach. The associate managing editor, Rose AnnCampise, was of great help during the final preparation of this manuscript.With a work this size there are bound to be some errors, inaccuracies, and ambiguousstatements. We would appreciate receiving your comments. We have set up a special electronicmail account for your feedback:http://www.biostat-text.infoLloyd D. FisherGerald van Belle

Preface to the Second EditionBiostatistics did not spring fully formed from the brow of R. A. Fisher, but evolved over manyyears. This process is continuing, although it may not be obvious from the outside. It has beenten years since the first edition of this book appeared (and rather longer since it was begun).Over this time, new areas of biostatistics have been developed and emphases and interpretationshave changed.The original authors, faced with the daunting task of updating a 1000-page text, decidedto invite two colleagues to take the lead in this task. These colleagues, experts in longitudinaldata analysis, survival analysis, computing, and all things modern and statistical, have given atwenty-first-century thrust to the book.The author sequence for the first edition was determined by the toss of a coin (see the Prefaceto the First Edition). For the second edition it was decided to switch the sequence of the firsttwo authors and add the new authors in alphabetical sequence.This second edition adds a chapter on randomized trials and another on longitudinal dataanalysis. Substantial changes have been made in discussing robust statistics, model building,survival analysis, and discrimination. Notes have been added, throughout, and many graphsredrawn. We have tried to eliminate errata found in the first edition, and while more haveundoubtedly been added, we hope there has been a net improvement. When you find mistakeswe would appreciate hearing about them at http://www.vanbelle.org/biostatistics/.Another major change over the past decade or so has been technological. Statistical softwareand the computers to run it have become much more widely available—many of the graphsand new analyses in this book were produced on a laptop that weighs only slightly more than acopy of the first edition—and the Internet provides ready access to information that used to beavailable only in university libraries. In order to accommodate the new sections and to attemptto keep up with future changes, we have shifted some material to a set of Web appendices. Thesemay be found at http://www.biostat-text.info. The Web appendices include notes, data sets andsample analyses, links to other online resources, all but a bare minimum of the statistical tablesfrom the first edition, and other material for which ink on paper is a less suitable medium.These advances in technology have not solved the problem of deadlines, and we wouldparticularly like to thank Steve Quigley at Wiley for his equanimity in the face of scheduleslippage.Gerald van BelleLloyd FisherPatrick HeagertyThomas LumleySeattle, June 15, 2003xi

CHAPTER 1Introduction to Biostatistics1.1INTRODUCTIONWe welcome the reader who wishes to learn biostatistics. In this chapter we introduce you tothe subject. We define statistics and biostatistics. Then examples are given where biostatisticaltechniques are useful. These examples show that biostatistics is an important tool in advancingour biological knowledge; biostatistics helps evaluate many life-and-death issues in medicine.We urge you to read the examples carefully. Ask yourself, “what can be inferred from theinformation presented?” How would you design a study or experiment to investigate the problemat hand? What would you do with the data after they are collected? We want you to realize thatbiostatistics is a tool that can be used to benefit you and society.The chapter closes with a description of what you may accomplish through use of this book.To paraphrase Pythagoras, there is no royal road to biostatistics. You need to be involved. Youneed to work hard. You need to think. You need to analyze actual data. The end result will bea tool that has immediate practical uses. As you thoughtfully consider the material presentedhere, you will develop thought patterns that are useful in evaluating information in all areas ofyour life.1.2WHAT IS THE FIELD OF STATISTICS?Much of the joy and grief in life arises in situations that involve considerable uncertainty. Hereare a few such situations:1. Parents of a child with a genetic defect consider whether or not they should have anotherchild. They will base their decision on the chance that the next child will have the samedefect.2. To choose the best therapy, a physician must compare the prognosis, or future course, ofa patient under several therapies. A therapy may be a success, a failure, or somewherein between; the evaluation of the chance of each occurrence necessarily enters into thedecision.3. In an experiment to investigate whether a food additive is carcinogenic (i.e., causes or atleast enhances the possibility of having cancer), the U.S. Food and Drug Administrationhas animals treated with and without the additive. Often, cancer will develop in both thetreated and untreated groups of animals. In both groups there will be animals that doBiostatistics: A Methodology for the Health Sciences, Second Edition, by Gerald van Belle, Lloyd D. Fisher,Patrick J. Heagerty, and Thomas S. LumleyISBN 0-471-03185-2 Copyright 2004 John Wiley & Sons, Inc.1

2INTRODUCTION TO BIOSTATISTICSnot develop cancer. There is a need for some method of determining whether the grouptreated with the additive has “too much” cancer.4. It is well known that “smoking causes cancer.” Smoking does not cause cancer in the samemanner that striking a billiard ball with another causes the second billiard ball to move.Many people smoke heavily for long periods of time and do not develop cancer. Theformation of cancer subsequent to smoking is not an invariable consequence but occursonly a fraction of the time. Data collected to examine the association between smokingand cancer must be analyzed with recognition of an uncertain and variable outcome.5. In designing and planning medical care facilities, planners take into account differingneeds for medical care. Needs change because there are new modes of therapy, as wellas demographic shifts, that may increase or decrease the need for facilities. All of theuncertainty associated with the future health of a population and its future geographic anddemographic patterns should be taken into account.Inherent in all of these examples is the idea of uncertainty. Similar situations do not alwaysresult in the same outcome. Statistics deals with this variability. This somewhat vague formulation will become clearer in this book. Many definitions of statistics explicitly bring in the ideaof variability. Some definitions of statistics are given in the Notes at the end of the chapter.1.3WHY BIOSTATISTICS?Biostatistics is the study of statistics as applied to biological areas. Biological laboratory experiments, medical research (including clinical research), and health services research all usestatistical methods. Many other biological disciplines rely on statistical methodology.Why should one study biostatistics rather than statistics, since the methods have wide applicability? There are three reasons for focusing on biostatistics:1. Some statistical methods are used more heavily in biostatistics than in other fields. Forexample, a general statistical textbook would not discuss the life-table method of analyzingsurvival data—of importance in many biostatistical applications. The topics in this bookare tailored to the applications in mind.2. Examples are drawn from the biological, medical, and health care areas; this helps youmaintain motivation. It also helps you understand how to apply statistical methods.3. A third reason for a biostatistical text is to teach the material to an audience of health professionals. In this case, the interaction between students and teacher, but especially amongthe students themselves, is of great value in learning and applying the subject matter.1.4GOALS OF THIS BOOKSuppose that we wanted to learn something about drugs; we can think of four different levelsof knowledge. At the first level, a person may merely know that drugs act chemically whenintroduced into the body and produce many different effects. A second, higher level of knowledgeis to know that a specific drug is given in certain situations, but we have no idea why theparticular drug works. We do not know whether a drug might be useful in a situation that wehave not yet seen. At the next, third level, we have a good idea why things work and alsoknow how to administer drugs. At this level we do not have complete knowledge of all thebiochemical principles involved, but we do have considerable knowledge about the activity andworkings of the drug.Finally, at the fourth and highest level, we have detailed knowledge of all of the interactionsof the drug; we know the current research. This level is appropriate for researchers: those seeking

STATISTICAL PROBLEMS IN BIOMEDICAL RESEARCH3to develop new drugs and to understand further the mechanisms of existing drugs. Think of thefield of biostatistics in analogy to the drug field discussed above. It is our goal that those whocomplete the material in this book should be on the third level. This book is written to enableyou to do more than apply statistical techniques mindlessly.The greatest danger is in statistical analysis untouched by the human mind. We have thefollowing objectives:1. You should understand specified statistical concepts and procedures.2. You should be able to identify procedures appropriate (and inappropriate) to a givensituation. You should also have the knowledge to recognize when you do not know of anappropriate technique.3. You should be able to carry out appropriate specified statistical procedures.These are high goals for you, the reader of the book. But experience has shown that professionals in a wide variety of biological and medical areas can and do attain this level ofexpertise. The material presented in the book is often difficult and challenging; time and effortwill, however, result in the acquisition of a valuable and indispensable tool that is useful in ourdaily lives as well as in scientific work.1.5STATISTICAL PROBLEMS IN BIOMEDICAL RESEARCHWe conclude this chapter with several examples of situations in which biostatistical design andanalysis have been or could have been of use. The examples are placed here to introduce youto the subject, to provide motivation for you if you have not thought about such matters before,and to encourage thought about the need for methods of approaching variability and uncertaintyin data.The examples below deal with clinical medicine, an area that has general interest. Otherexamples can be found in Tanur et al. [1989].1.5.1Example 1: Treatment of King Charles IIThis first example deals with the treatment of King Charles II during his terminal illness. Thefollowing quote is taken from Haggard [1929]:Some idea of the nature and number of the drug substances used in the medicine of the past maybe obtained from the records of the treatment given King Charles II at the time of his death. Theserecords are extant in the writings of a Dr. Scarburgh, one of the twelve or fourteen physicians calledin to treat the king. At eight o’clock on Monday morning of February 2, 1685, King Charles was beingshaved in his bedroom. With a sudden cry he fell backward and had a violent convulsion. He becameunconscious, rallied once or twice, and after a few days died. Seventeenth-century autopsy recordsare far from complete, but one could hazard a guess that the king suffered with an embolism—thatis, a floating blood clot which has plugged up an artery and deprived some portion of his brainof blood—or else his kidneys were diseased. As the first step in treatment the king was bled tothe extent of a pint from a vein in his right arm. Next his shoulder was cut into and the incisedarea “cupped” to suck out an additional eight ounces of blood. After this homicidal onslaught thedrugging began. An emetic and purgative were administered, and soon after a second purgative. Thiswas followed by an enema containing antimony, sacred bitters, rock salt, mallow leaves, violets, beetroot, camomile flowers, fennel seeds, linseed, cinnamon, cardamom seed, saphron, cochineal, andaloes. The enema was repeated in two hours and a purgative given. The king’s head was shaved and ablister raised on his scalp. A sneezing powder of hellebore root was administered, and also a powderof cowslip flowers “to strengthen his brain.” The cathartics were repeated at frequent intervals andinterspersed with a soothing drink composed of barley water, licorice and sweet almond. Likewise

4INTRODUCTION TO BIOSTATISTICSwhite wine, absinthe and anise were given, as also were extracts of thistle leaves, mint, rue, andangelica. For external treatment a plaster of Burgundy pitch and pigeon dung was applied to theking’s feet. The bleeding and purging continued, and to the medicaments were added melon seeds,manna, slippery elm, black cherry water, an extract of flowers of lime, lily-of-the-valley, peony,lavender, and dissolved pearls. Later came gentian root, nutmeg, quinine, and cloves. The king’scondition did not improve, indeed it grew worse, and in the emergency forty drops of extract ofhuman skull were administered to allay convulsions. A rallying dose of Raleigh’s antidote wasforced down the king’s throat; this antidote contained an enormous number of herbs and animalextracts. Finally bezoar stone was given. Then says Scarburgh: “Alas! after an ill-fated night hisserene majesty’s strength seemed exhausted to such a degree that the whole assembly of physicianslost all hope and became despondent: still so as not to appear to fail in doing their duty in any detail,they brought into play the most active cordial.” As a sort of grand summary to this pharmaceuticaldebauch a mixture of Raleigh’s antidote, pearl julep, and ammonia was forced down the throat ofthe dying king.From this time and distance there are comical aspects about this observational study describing the “treatment” given to King Charles. It should be remembered that his physicians weredoing their best according to the state of their knowledge. Our knowledge has advanced considerably, but it would be intellectual pride to assume that all modes of medical treatment in usetoday are necessarily beneficial. This example illustrates that there is a need for sound scientificdevelopment and verification in the biomedical sciences.1.5.2 Example 2: Relationship between the Use of Oral Contraceptives andThromboembolic DiseaseIn 1967 in Great Britain, there was concern about higher rates of thromboembolic disease (diseasefrom blood clots) among women using oral contraceptives than among women not using oralcontraceptives. To investigate the possibility of a relationship, Vessey and Doll [1969] studiedexisting cases with thromboembolic disease. Such a study is called a retrospective study becauseretrospectively, or after the fact, the cases were identified and data accumulated for analysis.The study began by identifying women aged 16 to 40 years who had been discharged fromone of 19 hospitals with a diagnosis of deep vein thrombosis, pulmonary embolism, cerebralthrombosis, or coronary thrombosis.The idea of the study was to interview the cases to see if more of them were using oralcontraceptives than one would “expect.” The investigators needed to know how much oralcontraceptive us to expect assuming that such us does not predispose people to thromboembolicdisease. This is done by identifying a group of women “comparable” to the cases. The amount oforal contraceptive use in this control, or comparison, group is used as a standard of comparisonfor the cases. In this study, two control women were selected for each case: The control womenhad suffered an acute surgical or medical condition, or had been admitted for elective surgery.The controls had the same age, date of hospital admission, and parity (number of live births)as the cases. The controls were selected to have the absence of any predisposing cause ofthromboembolic disease.If there is no relationship between oral contraception and thromboembolic disease, the caseswith thromboembolic disease would be no more likely than the controls to use oral contraceptives. In this study, 42 of 84 cases, or 50%, used oral contraceptives. Twenty-three of the 168controls, or 14%, of the controls used oral contraceptives. After deciding that such a differenceis unlikely to occur by chance, the authors concluded that there is a relationship between oralcontraceptive use and thromboembolic disease.This study is an example of a case–control study. The aim of such a study is to examinepotential risk factors (i.e., factors that may dispose a person to have the disease) for a disease.The study begins with the identification of cases with the disease specified. A control groupis then selected. The control group is a group of subjects comparable to the cases except forthe presence of the disease and the possible presence of the risk factor(s). The case and control

STATISTICAL PROBLEMS IN BIOMEDICAL RESEARCH5groups are then examined to see if a risk factor occurs more often than would be expected bychance in the cases than in the controls.1.5.3Example 3: Use of Laboratory Tests and the Relation to Quality of CareAn important feature of medical care are laboratory tests. These tests affect both the quality andthe cost of care. The frequency with which such tests are ordered varies with the physician. Itis not clear how the frequency of such tests influences the quality of medical care. Laboratorytests are sometimes ordered as part of “defensive” medical practice. Some of the variation is dueto training. Studies investigating the relationship between use of tests and quality of care needto be designed carefully to measure the quantities of interest reliably, without bias. Given theexpense of laboratory tests and limited time and resources, there clearly is a need for evaluationof the relationship between the use of laboratory tests and the quality of care.The study discussed here consisted of 21 physicians serving medical internships as reportedby Schroeder et al. [1974]. The interns were ranked independently on overall clinical capability(i.e., quality of care) by five faculty internists who had interacted with them during their medicaltraining. Only patients admitted with uncomplicated acute myocardial infarction or uncomplicated chest pain were considered for the study. “Medical records of all patients hospitalizedon the coronary care unit between July 1, 1971 and June 20, 1972, were analyzed and allpatients meeting the eligibility criteria were included in the study. . . . ” The frequency of laboratory utilization ordered during the first three days of hospitalization was translated into cost.Since daily EKGs and enzyme determinations (SGOT, LDH, and CPK) were ordered on allpatients, the costs of these tests were excluded. Mean costs of laboratory use were calculatedfor each intern’s subset of patients, and the interns were ranked in order of increasing costs ona per-patient basis.Ranking by the five faculty internists and by cost are given in Table 1.1. There is considerablevariation in the evaluations of the five internists; for example, intern K is ranked seventeenthin clinical competence by internists I and III, but first by internist II. This table still does notclearly answer the question of whether there is a relationship between clinical competence andthe frequency of use of laboratory tests and their cost. Figure 1.1 shows the relationship betweencost and one measure of clinical competence; on the basis of this graph and some statisticalcalculations, the authors conclude that “at least in the setting measured, no overall correlationexisted between cost of medical care and competence of medical care.”This study contains good examples of the types of (basically statistical) problems facing aresearcher in the health administration area. First, what is the population of interest? In otherwords, what population do the 21 interns represent? Second, there are difficult measurementproblems: Is level of clinical competence, as evaluated by an internist, equivalent to the level ofquality of care? How reliable are the internists? The variation in their assessments has alreadybeen noted. Is cost of laboratory use synonymous with cost of medical care as the authors seemto imply in their conclusion?1.5.4Example 4: Internal Mammary Artery LigationOne of the greatest health problems in the world, especially in industrialized nations, is coronaryartery disease. The coronary arteries are the arteries around the outside of the heart. These arteriesbring blood to the heart muscle (myocardium). Coronary artery disease brings a narrowing ofthe coronary arteries. Such narrowing often results in chest, neck, and arm pain (angina pectoris)precipitated by exertion. When arteries block off completely or occlude, a portion of the heartmuscle is deprived of its blood supply, with life-giving oxygen and nutrients. A myocardialinfarction, or heart attack, is the death of a portion of the heart muscle.As the coronary arteries narrow, the body often compensates by building collateral circulation, circulation that involves branches from existing coronary arteries that develop to bringblood to an area of restricted blood flow. The internal mammary arteries are arteries that bring

6INTRODUCTION TO BIOSTATISTICSTable 1.1 Independent Assessment of Clinical Competence of 21 Medical Interns by Five FacultyInternists and Ranking of Cost of Laboratory Procedures Ordered, George Washington UniversityHospital, 1971–1972Clinical 84879494Rank12345 77 7910111213141516171819 20.520.5Rank of Costs ofProcedures Orderedb105781691318121201921141711415325Source: Data from Schroeder et al. [1974]; by permission of Medical Care.a 1 most competent.b 1 least expensive.blood to the chest. The tributaries of the internal mammary arter

Biostatistics A Methodology for the Health Sciences Second Edition GERALD VAN BELLE LLOYD D. FISHER PATRICK J. HEAGERTY THOMAS LUMLEY Department of Biostatistics and Department of Environmental and Occupational Health Sciences University of Washington Seatt