LINEAR MODELS IN STATISTICS

Transcription

LINEAR MODELS INSTATISTICS

LINEAR MODELS INSTATISTICSSecond EditionAlvin C. Rencher and G. Bruce SchaaljeDepartment of Statistics, Brigham Young University, Provo, Utah

Copyright # 2008 by John Wiley & Sons, Inc. All rights reservedPublished by John Wiley & Sons, Inc., Hoboken, New JerseyPublished simultaneously in CanadaNo part of this publication may be reproduced, stored in a retrieval system, or transmitted in any formor by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, exceptas permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either theprior written permission of the Publisher, or authorization through payment of the appropriate per-copyfee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978)750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher forpermission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 RiverStreet, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission.Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best effortsin preparing this book, they make no representations or warranties with respect to the accuracy orcompleteness of the contents of this book and specifically disclaim any implied warranties ofmerchantability or fitness for a particular purpose. No warranty may be created or extended by salesrepresentatives or written sales materials. The advice and strategies contained herein may not besuitable for your situation. You should consult with a professional where appropriate. Neither thepublisher nor author shall be liable for any loss of profit or any other commercial damages, includingbut not limited to special, incidental, consequential, or other damages.For general information on our other products and services or for technical support, please contact ourCustomer Care Department within the United States at (800) 762-2974, outside the United Statesat (317) 572-3993 or fax (317) 572-4002.Wiley also publishes its books in variety of electronic formats. Some content that appears in printmay not be available in electronic formats. For more information about Wiley products, visit ourweb site at www.wiley.com.Wiley Bicentennial Logo: Richard J. PacificoLibrary of Congress Cataloging-in-Publication Data:Rencher, Alvin C., 1934Linear models in statistics/Alvin C. Rencher, G. Bruce Schaalje. – 2nd ed.p. cm.Includes bibliographical references.ISBN 978-0-471-75498-5 (cloth)1. Linear models (Statistics) I. Schaalje, G. Bruce. II. Title.QA276.R425 2007519.50 35–dc222007024268Printed in the United States of America10 9 8 7 6 5 4 3 2 1

CONTENTSPrefacexiii1 Introduction1.11.21.31Simple Linear Regression Model 1Multiple Linear Regression Model 2Analysis-of-Variance Models 32 Matrix AlgebraMatrix and Vector Notation 52.1.1 Matrices, Vectors, and Scalars 52.1.2 Matrix Equality 62.1.3 Transpose 72.1.4 Matrices of Special Form 72.2 Operations 92.2.1 Sum of Two Matrices or Two Vectors 92.2.2 Product of a Scalar and a Matrix 102.2.3 Product of Two Matrices or Two Vectors 102.2.4 Hadamard Product of TwoMatrices or Two Vectors 162.3 Partitioned Matrices 162.4 Rank 192.5 Inverse 212.6 Positive Definite Matrices 242.7 Systems of Equations 282.8 Generalized Inverse 322.8.1 Definition and Properties 332.8.2 Generalized Inverses and Systems of Equations2.9 Determinants 372.10 Orthogonal Vectors and Matrices 412.11 Trace 442.12 Eigenvalues and Eigenvectors 462.12.1 Definition 462.12.2 Functions of a Matrix 4952.136v

viCONTENTS2.132.142.12.3 Products 502.12.4 Symmetric Matrices 512.12.5 Positive Definite and Semidefinite Matrices 53Idempotent Matrices 54Vector and Matrix Calculus 562.14.1 Derivatives of Functions of Vectors and Matrices 562.14.2 Derivatives Involving Inverse Matrices and Determinants 582.14.3 Maximization or Minimization of a Function of a Vector 603 Random Vectors and Matrices3.13.23.33.43.53.669Introduction 69Means, Variances, Covariances, and Correlations 70Mean Vectors and Covariance Matrices for Random Vectors3.3.1 Mean Vectors 753.3.2 Covariance Matrix 753.3.3 Generalized Variance 773.3.4 Standardized Distance 77Correlation Matrices 77Mean Vectors and Covariance Matrices forPartitioned Random Vectors 78Linear Functions of Random Vectors 793.6.1 Means 803.6.2 Variances and Covariances 814 Multivariate Normal Distribution4.14.24.34.44.5Univariate Normal Density Function 87Multivariate Normal Density Function 88Moment Generating Functions 90Properties of the Multivariate Normal DistributionPartial Correlation 10087925 Distribution of Quadratic Forms in y5.15.25.35.45.55.6Sums of Squares 105Mean and Variance of Quadratic Forms 107Noncentral Chi-Square Distribution 112Noncentral F and t Distributions 1145.4.1 Noncentral F Distribution 1145.4.2 Noncentral t Distribution 116Distribution of Quadratic Forms 117Independence of Linear Forms and Quadratic Forms75105119

viiCONTENTS6 Simple Linear Regression6.16.26.36.4The Model 127Estimation of b0, b1, and s 2 128Hypothesis Test and Confidence Interval for b1Coefficient of Determination 1331271327 Multiple Regression: Estimation1377.17.27.3Introduction 137The Model 137Estimation of b and s 2 1417.3.1 Least-Squares Estimator for b 1457.3.2 Properties of the Least-Squares Estimator b̂ 1417.3.3 An Estimator for s 2 1497.4 Geometry of Least-Squares 1517.4.1 Parameter Space, Data Space, and Prediction Space 1527.4.2 Geometric Interpretation of the MultipleLinear Regression Model 1537.5 The Model in Centered Form 1547.6 Normal Model 1577.6.1 Assumptions 1577.6.2 Maximum Likelihood Estimators for b and s 2 1587.6.3 Properties of b̂ and ŝ2 1597.7 R 2 in Fixed-x Regression 1617.8 Generalized Least-Squares: cov(y) ¼ s 2V 1647.8.1 Estimation of b and s 2 when cov(y) ¼ s 2V 1647.8.2 Misspecification of the Error Structure 1677.9 Model Misspecification 1697.10 Orthogonalization 1748 Multiple Regression: Tests of Hypothesesand Confidence Intervals8.18.28.38.48.5Test of Overall Regression 185Test on a Subset of the b Values 189F Test in Terms of R 2 196The General Linear Hypothesis Tests for H0:Cb ¼ 0 and H0: Cb ¼ t 1988.4.1 The Test for H0: Cb ¼ 0 1988.4.2 The Test for H0: Cb ¼ t 203Tests on bj and a0 b 2048.5.1 Testing One bj or One a0 b 2048.5.2 Testing Several bj or a0 ib Values 205185

viiiCONTENTS8.68.7Confidence Intervals and Prediction Intervals 2098.6.1 Confidence Region for b 2098.6.2 Confidence Interval for bj 2108.6.3 Confidence Interval for a0 b 2118.6.4 Confidence Interval for E(y) 2118.6.5 Prediction Interval for a Future Observation8.6.6 Confidence Interval for s 2 2158.6.7 Simultaneous Intervals 215Likelihood Ratio Tests 2172139 Multiple Regression: Model Validation and Diagnostics9.19.29.39.410235Multiple Regression: Random x’s10.110.210.310.410.510.610.710.811Residuals 227The Hat Matrix 230Outliers 232Influential Observations and Leverage243Multivariate Normal Regression Model 244Estimation and Testing in Multivariate Normal Regression 245Standardized Regression Coefficents 249R 2 in Multivariate Normal Regression 254Tests and Confidence Intervals for R 2 258Effect of Each Variable on R 2 262Prediction for Multivariate Normal or Nonnormal Data 265Sample Partial Correlations 266Multiple Regression: Bayesian Inference11.111.2227Elements of Bayesian Statistical Inference 277A Bayesian Multiple Linear Regression Model 27911.2.1 A Bayesian Multiple Regression Modelwith a Conjugate Prior 28011.2.2 Marginal Posterior Density of b 28211.2.3 Marginal Posterior Densities of t and s 2 28411.3 Inference in Bayesian Multiple Linear Regression 28511.3.1 Bayesian Point and Interval Estimates ofRegression Coefficients 28511.3.2 Hypothesis Tests for Regression Coefficientsin Bayesian Inference 28611.3.3 Special Cases of Inference in Bayesian MultipleRegression Models 28611.3.4 Bayesian Point and Interval Estimation of s 2 287277

ixCONTENTS11.411.5Bayesian Inference through Markov ChainMonte Carlo Simulation 288Posterior Predictive Inference 29012 Analysis-of-Variance Rank Models 29512.1.1 One-Way Model 29512.1.2 Two-Way Model 299Estimation 30112.2.1 Estimation of b 30212.2.2 Estimable Functions of b 305Estimators 30912.3.1 Estimators of l0 b 30912.3.2 Estimation of s 2 31312.3.3 Normal Model 314Geometry of Least-Squares in theOverparameterized Model 316Reparameterization 318Side Conditions 320Testing Hypotheses 32312.7.1 Testable Hypotheses 32312.7.2 Full-Reduced-Model Approach 32412.7.3 General Linear Hypothesis 326An Illustration of Estimation and Testing 32912.8.1 Estimable Functions 33012.8.2 Testing a Hypothesis 33112.8.3 Orthogonality of Columns of X 33313 One-Way Analysis-of-Variance: Balanced CaseThe One-Way Model 339Estimable Functions 340Estimation of Parameters 34113.3.1 Solving the Normal Equations 34113.3.2 An Estimator for s 2 34313.4 Testing the Hypothesis H0: m1 ¼ m2 ¼ . . . ¼ mk13.4.1 Full – Reduced-Model Approach 34413.4.2 General Linear Hypothesis 34813.5 Expected Mean Squares 35113.5.1 Full-Reduced-Model Approach 35213.5.2 General Linear Hypothesis 35433913.113.213.3344

xCONTENTS13.614Contrasts 35713.6.1 Hypothesis Test for a Contrast 35713.6.2 Orthogonal Contrasts 35813.6.3 Orthogonal Polynomial Contrasts 363Two-Way Analysis-of-Variance: Balanced Case37714.114.214.3The Two-Way Model 377Estimable Functions 378Estimators of l0 b and s 2 38214.3.1 Solving the Normal Equations and Estimating l0 b14.3.2 An Estimator for s 2 38414.4 Testing Hypotheses 38514.4.1 Test for Interaction 38514.4.2 Tests for Main Effects 39514.5 Expected Mean Squares 40314.5.1 Sums-of-Squares Approach 40314.5.2 Quadratic Form Approach 40515Analysis-of-Variance: The Cell Means Model forUnbalanced Data38241315.115.2Introduction 413One-Way Model 41515.2.1 Estimation and Testing 41515.2.2 Contrasts 41715.3 Two-Way Model 42115.3.1 Unconstrained Model 42115.3.2 Constrained Model 42815.4 Two-Way Model with Empty Cells 43216Analysis-of-CovarianceIntroduction 443Estimation and Testing 44416.2.1 The Analysis-of-Covariance Model16.2.2 Estimation 44616.2.3 Testing Hypotheses 44816.3 One-Way Model with One Covariate 44916.3.1 The Model 44916.3.2 Estimation 44916.3.3 Testing Hypotheses 45044316.116.2444

xiCONTENTS16.4Two-Way Model with One Covariate 45716.4.1 Tests for Main Effects and Interactions 45816.4.2 Test for Slope 46216.4.3 Test for Homogeneity of Slopes 46316.5 One-Way Model with Multiple Covariates 46416.5.1 The Model 46416.5.2 Estimation 46516.5.3 Testing Hypotheses 46816.6 Analysis-of-Covariance with Unbalanced Models 47317 Linear Mixed ModelsIntroduction 479The Linear Mixed Model 479Examples 481Estimation of Variance Components 486Inference for b 49017.5.1 An Estimator for b 49017.5.2 Large-Sample Inference for Estimable Functions of b17.5.3 Small-Sample Inference for Estimable Functions of b17.6 Inference for the ai Terms 49717.7 Residual Diagnostics 50147917.117.217.317.417.518 Additional Models18.118.218.318.418.5491491507Nonlinear Regression 507Logistic Regression 508Loglinear Models 511Poisson Regression 512Generalized Linear Models 513Appendix AAnswers and Hints to the Problems517References653Index663

PREFACEIn the second edition, we have added chapters on Bayesian inference in linear models(Chapter 11) and linear mixed models (Chapter 17), and have upgraded the materialin all other chapters. Our continuing objective has been to introduce the theory oflinear models in a clear but rigorous format.In spite of the availability of highly innovative tools in statistics, the main tool ofthe applied statistician remains the linear model. The linear model involves the simplest and seemingly most restrictive statistical properties: independence, normality,constancy of variance, and linearity. However, the model and the statisticalmethods associated with it are surprisingly versatile and robust. More importantly,mastery of the linear model is a prerequisite to work with advanced statistical toolsbecause most advanced tools are generalizations of the linear model. The linearmodel is thus central to the training of any statistician, applied or theoretical.This book develops the basic theory of linear models for regression, analysis-ofvariance, analysis–of–covariance, and linear mixed models. Chapter 18 briefly introduces logistic regression, generalized linear models, and nonlinear models.Applications are illustrated by examples and problems using real data. This combinationof theory and applications will prepare the reader to further explore the literature and tomore correctly interpret the output from a linear models computer package.This introductory linear models book is designed primarily for a one-semestercourse for advanced undergraduates or MS students. It includes more material thancan be covered in one semester so as to give an instructor a choice of topics and toserve as a reference book for researchers who wish to gain a better understandingof regression and analysis-of-variance. The book would also serve well as a textfor PhD classes in which the instructor is looking for a one-semester introduction,and it would be a good supplementary text or reference for a more advanced PhDclass for which the students need to review the basics on their own.Our overriding objective in the preparation of this book has been clarity of exposition. We hope that students, instructors, researchers, and practitioners will find thislinear models text more comfortable than most. In the final stages of development, weasked students for written comments as they read each day’s assignment. They mademany suggestions that led to improvements in readability of the book. We are gratefulto readers who have notified us of errors and other suggestions for improvements ofthe text, and we will continue to be very grateful to readers who take the time to do sofor this second edition.xiii

xivPREFACEAnother objective of the book is to tie up loose ends. There are many approachesto teaching regression, for example. Some books present estimation of regressioncoefficients for fixed x’s only, other books use random x’s, some use centeredmodels, and others define estimated regression coefficients in terms of variancesand covariances or in terms of correlations. Theory for linear models has been presented using both an algebraic and a geometric approach. Many books present classical (frequentist) inference for linear models, while increasingly the Bayesianapproach is presented. We have tried to cover all these approaches carefully and toshow how they relate to each other. We have attempted to do something similarfor various approaches to analysis-of-variance. We believe that this will make thebook useful as a reference as well as a textbook. An instructor can choose theapproach he or she prefers, and a student or researcher has access to other methodsas well.The book includes a large number of theoretical problems and a smaller number ofapplied problems using real datasets. The problems, along with the extensive set ofanswers in Appendix A, extend the book in two significant ways: (1) the theoreticalproblems and answers fill in nearly all gaps in derivations and proofs and also extendthe coverage of material in the text, and (2) the applied problems and answers becomeadditional examples illustrating the theory. As instructors, we find that havinganswers available for the students saves a great deal of class time and enables us tocover more material and cover it better. The answers would be especially useful toa reader who is engaging this material outside the formal classroom setting.The mathematical prerequisites for this book are multivariable calculus and matrixalgebra. The review of matrix algebra in Chapter 2 is intended to be sufficiently complete so that the reader with no previous experience can master matrix manipulationup to the level required in this book. Statistical prerequisites include some exposure tostatistical theory, with coverage of topics such as distributions of random variables,expected values, moment generating functions, and an introduction to estimationand testing hypotheses. These topics are briefly reviewed as each is introduced.One or two statistical methods courses would also be helpful, with coverage oftopics such as t tests, regression, and analysis-of-variance.We have made considerable effort to maintain consistency of notation throughoutthe book. We have also attempted to employ standard notation as far as possible andto avoid exotic characters that cannot be readily reproduced on the chalkboard. With afew exceptions, we have refrained from the use of abbreviations and mnemonicdevices. We often find these annoying in a book or journal article.Equations are numbered sequentially throughout each chapter; for example, (3.29)indicates the twenty-ninth numbered equation in Chapter 3. Tables and figures arealso numbered sequentially throughout each chapter in the form “Table 7.4” or“Figure 3.2.” On the other hand, examples and theorems are numbered sequentiallywithin a section, for example, Theorems 2.2a and 2.2b.The solution of most of the problems with real datasets requires the use of the computer. We have not discussed command files or output of any particular program,because there are so many good packages available. Computations for the numericalexamples and numerical problems were done with SAS. The datasets and SAS

PREFACExvcommand files for all the numerical examples and problems in the text are availableon the Internet; see Appendix B.The references list is not intended to be an exhaustive survey of the literature. Wehave provided original references for some of the basic results in linear models andhave also referred the reader to many up-to-date texts and reference books useful forfurther reading. When citing references in the text, we have used the standard formatinvolving the year of publication. For journal articles, the year alone suffices, forexample, Fisher (1921). But for a specific reference in a book, we have included apage number or section, as in Hocking (1996, p. 216).Our selection of topics is intended to prepare the reader for a better understandingof applications and for further reading in topics such as mixed models, generalizedlinear models, and Bayesian models. Following a brief introduction in Chapter 1,Chapter 2 contains a careful review of all aspects of matrix algebra needed to readthe book. Chapters 3, 4, and 5 cover properties of random vectors, matrices, andquadratic forms. Chapters 6, 7, and 8 cover simple and multiple linear regression,including estimation and testing hypotheses and consequences of misspecificationof the model. Chapter 9 provides diagnostics for model validation and detection ofinfluential observations. Chapter 10 treats multiple regression with random x’s.Chapter 11 covers Bayesian multiple linear regression models along with Bayesianinferences based on those models. Chapter 12 covers the basic theory of analysisof-variance models, including estimability and testability for the overparameterizedmodel, reparameterization, and the imposition of side conditions. Chapters 13 and14 cover balanced one-way and two-way analysis-of-variance models using an overparameterized model. Chapter 15 covers unbalanced analysis-of-variance modelsusing a cell means model, including a section on dealing with empty cells in twoway analysis-of-variance. Chapter 16 covers analysis of covariance models.Chapter 17 covers the basic theory of linear mixed models, including residualmaximum likelihood estimation of variance components, approximate smallsample inferences for fixed effects, best linear unbiased prediction of randomeffects, and residual analysis. Chapter 18 introduces additional topics such asnonlinear regression, logistic regression, loglinear models, Poisson regression, andgeneralized linear models.In our class for first-year master’s-level students, we cover most of the material inChapters 2 – 5, 7 – 8, 10 – 12, and 17. Many other sequences are possible. For example,a thorough one-semester regression and analysis-of-variance course could coverChapters 1 – 10, and 12 – 15.Al’s introduction to linear models came in classes taught by Dale Richards andRolf Bargmann. He also learned much from the books by Graybill, Scheffé, andRao. Al expresses thanks to the following for reading the first edition manuscriptand making many valuable suggestions: David Turner, John Walker, JoelReynolds, and Gale Rex Bryce. Al thanks the following students at BrighamYoung University (BYU) who helped with computations, graphics, and typing ofthe first edition: David Fillmore, Candace Baker, Scott Curtis, Douglas Burton,David Dahl, Brenda Price, Eric Hintze, James Liechty, and Joy Willbur. The students

xviPREFACEin Al’s Linear Models class went through the manuscript carefully and spotted manytypographical errors and passages that needed additional clarification.Bruce’s education in linear models came in classes taught by Mel Carter, DelScott, Doug Martin, Peter Bloomfield, and Francis Giesbrecht, and influential shortcourses taught by John Nelder and Russ Wolfinger.We thank Bruce’s Linear Models classes of 2006 and 2007 for going through thebook and new chapters. They made valuable suggestions for improvement of the text.We thank Paul Martin and James Hattaway for invaluable help with LaTex. TheDepartment of Statistics, Brigham Young University provided financial supportand encouragement throughout the project.Second EditionFor the second edition we added Chapter 11 on Bayesian inference in linear models(including Gibbs sampling) and Chapter 17 on linear mixed models.We also added a section in Chapter 2 on vector and matrix calculus, adding severalnew theorems and covering the Lagrange multiplier method. In Chapter 4, we presented a new proof of the conditional distribution of a subvector of a multivariatenormal vector. In Chapter 5, we provided proofs of the moment generating functionand variance of a quadratic form of a multivariate normal vector. The section on thegeometry of least squares was completely rewritten in Chapter 7, and a section on thegeometry of least squares in the overparameterized linear model was added toChapter 12. Chapter 8 was revised to provide more motivation for hypothesistesting and simultaneous inference. A new section was added to Chapter 15dealing with two-way analysis-of-variance when there are empty cells. This materialis not available in any other textbook that we are aware of.This book would not have been possible without the patience, support, andencouragement of Al’s wife LaRue and Bruce’s wife Lois. Both have helped and supported us in more ways than they know. This book is dedicated to them.ALVIN C. RENCHERDepartment of StatisticsBrigham Young UniversityProvo, UtahANDG. BRUCE SCHAALJE

1IntroductionThe scientific method is frequently used as a guided approach to learning. Linearstatistical methods are widely used as part of this learning process. In the biological,physical, and social sciences, as well as in business and engineering, linear modelsare useful in both the planning stages of research and analysis of the resulting data.In Sections 1.1– 1.3, we give a brief introduction to simple and multiple linearregression models, and analysis-of-variance (ANOVA) models.1.1SIMPLE LINEAR REGRESSION MODELIn simple linear regression, we attempt to model the relationship between two variables, for example, income and number of years of education, height and weightof people, length and width of envelopes, temperature and output of an industrialprocess, altitude and boiling point of water, or dose of a drug and response. For alinear relationship, we can use a model of the formy ¼ b0 þ b1 x þ 1,(1:1)where y is the dependent or response variable and x is the independent or predictorvariable. The random variable 1 is the error term in the model. In this context, errordoes not mean mistake but is a statistical term representing random fluctuations,measurement errors, or the effect of factors outside of our control.The linearity of the model in (1.1) is an assumption. We typically add otherassumptions about the distribution of the error terms, independence of the observedvalues of y, and so on. Using observed values of x and y, we estimate b0 and b1 andmake inferences such as confidence intervals and tests of hypotheses for b0 and b1 .We may also use the estimated model to forecast or predict the value of y for aparticular value of x, in which case a measure of predictive accuracy may also beof interest.Estimation and inferential procedures for the simple linear regression model aredeveloped and illustrated in Chapter 6.Linear Models in Statistics, Second Edition, by Alvin C. Rencher and G. Bruce SchaaljeCopyright # 2008 John Wiley & Sons, Inc.1

2INTRODUCTION1.2MULTIPLE LINEAR REGRESSION MODELThe response y is often influenced by more than one predictor variable. For example,the yield of a crop may depend on the amount of nitrogen, potash, and phosphate fertilizers used. These variables are controlled by the experimenter, but the yield mayalso depend on uncontrollable variables such as those associated with weather.A linear model relating the response y to several predictors has the formy ¼ b0 þ b1 x1 þ b2 x2 þ þ bk xk þ 1:(1:2)The parameters b0 , b1 , . . . , bk are called regression coefficients. As in (1.1), 1provides for random variation in y not explained by the x variables. This randomvariation may be due partly to other variables that affect y but are not known ornot observed.The model in (1.2) is linear in the b parameters; it is not necessarily linear in the xvariables. Thus models such asy ¼ b0 þ b1 x1 þ b2 x21 þ b3 x2 þ b4 sin x2 þ 1are included in the designation linear model.A model provides a theoretical framework for better understanding of a phenomenon of interest. Thus a model is a mathematical construct that we believe mayrepresent the mechanism that generated the observations at hand. The postulatedmodel may be an idealized oversimplification of the complex real-world situation,but in many such cases, empirical models provide useful approximations of therelationships among variables. These relationships may be either associative orcausative.Regression models such as (1.2) are used for various purposes, including thefollowing:1. Prediction. Estimates of the individual parameters b0 , b1 , . . . , bk are of lessimportance for prediction than the overall influence of the x variables on y.However, good estimates are needed to achieve good prediction performance.2. Data Description or Explanation. The scientist or engineer uses the estimatedmodel to summarize or describe the observed data.3. Parameter Estimation. The values of the estimated parameters may havetheoretical implications for a postulated model.4. Variable Selection or Screening. The emphasis is on determining the importance of each predictor variable in modeling the variation in y. The predictorsthat are associated with an important amount of variation in y are retained;those that contribute little are deleted.5. Control of Output. A cause-and-effect relationship between y and the xvariables is assumed. The estimated model might then be used to control the

1.3 ANALYSIS-OF-VARIANCE MODELS3output of a process by varying the inputs. By systematic experimentation, itmay be possible to achieve the optimal output.There is a fundamental difference between purposes 1 and 5. For prediction, we needonly assume that the same correlations that prevailed when the data were collectedalso continue in place when the predictions are to be made. Showing that there is asignificant relationship between y and the x variables in (1.2) does not necessarilyprove that the relationship is causal. To establish causality in order to controloutput, the researcher must choose the values of the x variables in the model anduse randomization to avoid the effects of other possible variables unaccounted for.In other words, to ascertain the effect of the x variables on y when the x variablesare changed, it is necessary to change them.Estimation and inferential procedures that contribute to the five purposes listedabove are discussed in Chapters 7 – 11.1.3ANALYSIS-OF-VARIANCE MODELSIn analysis-of-variance (ANOVA) models, we are interested in comparing severalpopulations or several conditions in a study. Analysis-of-variance models can beexpressed as linear models with restrictions on the x values. Typically the x’s are 0sor 1s. For example, suppose that a researcher wishes to compare the mean yield forfour types of catalyst in an industrial process. If n observations are to be obtained foreach catalyst, one model for the 4n observations can be expressed asyij ¼ mi þ 1ij ,i ¼ 1, 2, 3, 4,j ¼ 1, 2, . . . , n,(1:3)where mi is the mean corresponding to the ith catalyst. A hypothesis of interest isH0 : m1 ¼ m2 ¼ m3 ¼ m4 . The model in (1.3) can be expressed in the alternative formyij ¼ m þ ai þ 1ij ,i ¼ 1, 2, 3, 4, j ¼ 1, 2, . . . , n:(1:4)In this form, ai is the effect of the ith catalyst, and the hypothesis can be expressed asH0 : a1 ¼ a2 ¼ a3 ¼ a4 .Suppose that the researcher also wishes to compare the effects of three levels oftemperature and that n observations are taken at each of the 12 catalyst– temperaturecombinations. Then the model can be expressed asyijk ¼ mij þ 1ijk ¼ m þ ai þ bj þ gij þ 1ijk(1:5)i ¼ 1, 2, 3, 4; j ¼ 1, 2, 3; k ¼ 1, 2, . . . , n,where mij is the mean for the ijth catalyst – temperature combination, ai is the effect ofthe ith catalyst, bj is the effect of the jth level of temperature, and gij is the interactionor joint effect of the ith catalyst and jth level of temperature.

4INTRODUCTIONIn the examples leading to models (1.3) – (1.5), the researcher chooses the type ofcatalyst or level of

Linear models in statistics/Alvin C. Rencher, G. Bruce Schaalje. – 2nd ed. p. cm. Includes bibliographical references. ISBN 978-0-471-75498-5 (cloth) 1. Linear models (Statistics) I. Schaalje, G. Bruce. II. Title. QA276.R425 2007 519.5035–dc22 20070