Analysis Of Variance For Regression/Multiple Regression

Transcription

Analysis of Variance for Regression/MultipleRegressionLecture Notes XVIIStatistic 112, Fall 2002

AnnouncementsThe second midterm is next Thursday.Extra Office Hours Next Week: Monday, 1:30-2:30; Wednesday, 9-10, 1:30-2:30 or by appointment. Usual officehours on Tuesday, 1-2 and 4:30-5:30.Haipeng will hold office hours tomorrow from 10:30-12:30. A tutor from the university tutor service will hold a review session on Tuesday (Nov. 12) from 6-8 p.m. in Huntsman Hall, Room 250. He is not affiliated with the course.I will post a set of exercises for this week’s lectures by Fridaynight.

OutlineAnalysis of variance for regression.Multiple regression. – Basic model.– Estimating and interpreting the parameters. – The impact of lurking variables.Reading for this time: Chapter 10.2 (analysis of variance forregression part), Chapter 11.

Analysis of Variance for RegressionThe analysis of variance (ANOVA) provides a convenientmethod of comparing the fit of two or more models to the sameset of data. Here we are interested in comparing1. A simple linear regression model in which the slope is zero, vs.2. A simple linear regression model in which the slope is notzero, .For both models it is assumed that , independent.Analysis of variance summarizes information about the sources of variation in the data.Total variation in the response "!#.is expressed by the deviations

Two reasons why– Responses #does not equal :correspond to different values of the explanatoryvariable . The fitted valuesestimates the mean response "! # mean responses due to differences in .for each specific . The differencereflects variation in– Individual observations will vary about mean because ofvariation within subpopulation of responses to a fixedvariation is represented by the residuals "! . . This

Sums of SquaresBasic idea behind analysis of variance: If , thenall variation should be due to individual observations varyingabout their mean. We can estimate the amount of variationdue to the responses the explanatory variable corresponding to different values ofand base our test on this estimate. !# !# " !# ! Algebraic fact: # !We write (1) as ! (1) SS stands for sum of squares and T,M and E stand for total,model and error respectively. Total variation SST is the sum ofvariation due to the straight-line model for the regressionfunction (SSM) and variation due to deviations from this model(SSE).If were true, then SSM should be small.Degrees of freedom are associated with each sum of squares.

Degrees of freedom can be thought of as the number ofindependent pieces of information that the sum of squaresreflects. ! . ! . Mean square (MS)Interpretation of sum of squaresdegrees of freedom: fraction of variation in the values of is explained by the least squares regression of ! !# # ! on . ! ! #that

The ANOVA F test ( is not linearly related to ) can be tested bycomparing MSM with MSE. The ANOVA test statistic is will tend to be small when Under is true and large whenis true. , the statistic has anof freedom in the numerator anddistribution with! degreedegrees of freedom inthe denominator (Table E). For simple linear regression, the -test of versus test is equivalent to the .

Multiple regressionConsider again the problem of deciding how many years youshould stay in school. Suppose that you now have availablethe joint distributions of earnings ( ), education (( ) and IQs) for a sample from a population of people like yourself.You could just use the regression function to makeyour prediction but given the extra information about IQ in thesample, it is natural to try to use it.The natural way to use the extra information is to use the multiple regression function to make yourprediction (you substitute in your IQ to make the prediction).Population regression function: Data for multiple regression: " " Person 1 Person 2Person . .

Multiple Linear Regression ModelOne possible model for the population regression function isthe multiple linear regression model, an analogue of the simplelinear regression model: " Interpretation of " : The change in the mean of if isincreased by one unit and all other explanatory variables, " are held fixed.The multiple linear regression model is very flexible. Examplesof multiple linear regression models: "

Probability Model for Multiple Linear RegressionThe statistical model for multiple linear regression is for .The mean response is a linear function of the explanatory variables The residuals are independent and normally distributed withmean 0 and standard deviationsimple random sample from a . In other words, they are a "distribution.

Estimation of the multiple regression parameters Let .denote the estimators of .For the th observations the predicted response is The th residual, the difference between the observed andpredicted response, is observed response predicted response ! !To estimate ! ! ! ! , we use the method of least squares.Choose the values of the ’s that makes the sum of thesquares as small as possible, i.e., choose tominimize !The parameter ! " ! ! measures the variability of the responsesabout the population regression equation. We estimate by

which is an average of the squared residuals We estimateerror. by ! !! . We call the root mean squared

The Impact of Lurking VariablesYou want to predict what your earnings will be if you obtain a certain number of years of education. Let " years of education and earnings,IQ. Suppose that you havea sample thatonly contains earnings and years of educationdata, people’s IQs are not recoreded.IQ is probably a lurking variable, i.e., a variable that has an important effect on the relationship among the variables in a study but is not included among the variables studied.Suppose that the population regression function for theexpected value of earnings given education and IQ is amultiple linear regression function: " " and also suppose that " "

What is ? If we use least squares to estimate the simple linear regression function , the slope of the least squares line will bean unbiased estimate ofequal which does not generally. Thus, by regressing earnings on only years ofeducation, you will not obtain the right slope for estimating the impact of additional years of education given your fixed IQ.Two circumstances in which– :, i.e., IQ does not help to predict earnings onceeducation is included.– , i.e., years of education does not help to predict IQ.

Effect of omitting ability on estimates of the returns toeducation: log earnings. The interpretation of the least squarescoefficient is that it approximately measures the percentincrease in earnings for one extra year of schooling.Data SetIQ omittedIQ includedMale Ph.D’s, 1958-19600.02050.0213Rejected low-AFQT0.03460.0171NLSYM (1969)0.0650.059Veterans - CPS (1964)0.05080.0433NLSYM (1973)0.0410.030military trained0.03460.0171G-770.0220.014military trainedapplicants (1962)

Analysis of Variance for Regression The analysis of variance (ANOVA) provides a convenient method of comparing the fit of two or more models to the same set of data. Here we are interested in comparing 1. A simple linear regression model in which the slope is zero, vs. 2. A simple linear regression model in which the slope is not zero, .