SPSS Discriminant Function Analysis

Transcription

SPSS Discriminant FunctionAnalysisBy Hui BianOffice for Faculty Excellence

Discriminant Function Analysis What is Discriminant function analysis It builds a predictive model for groupmembership The model is composed of a discriminantfunction based on linear combinations ofpredictor variables. Those predictor variables provide the bestdiscrimination between groups.2

Discriminant Function Analysis Purpose of Discriminant analysis to maximally separate the groups. to determine the most parsimonious way toseparate groups to discard variables which are little relatedto group distinctions3

Discriminant Function Analysis Summary: we are interested in the relationshipbetween a group of independent variables and onecategorical variable. We would like to know howmany dimensions we would need to express thisrelationship. Using this relationship, we canpredict a classification based on the independentvariables or assess how well the independentvariables separate the categories in theclassification.4

Discriminant Function Analysis It is similar to regression analysis A discriminant score can be calculated based on theweighted combination of the independent variables Di a b1x1 b2x2 bnxn Di is predicted score (discriminant score) x is predictor and b is discriminant coefficient We use maximum likelihood technique to assign a caseto a group from a specified cut-off score. If group size is equal, the cut-off is mean score. If group size is not equal, the cut-off is calculated fromweighted means.5

Discriminant Function Analysis Grouping variables Categorical variables Can have more than two values The codes for the grouping variables must beintegers Independent variables Continuous Nominal variables must be recoded to dummyvariables6

Discriminant Function Analysis Discriminant function A latent variable of a linear combination ofindependent variables One discriminant function for 2-group discriminantanalysis For higher order discriminant analysis, the number ofdiscriminant function is equal to g-1 (g is the numberof categories of dependent/grouping variable). The first function maximizes the difference betweenthe values of the dependent variable. The second function maximizes the differencebetween the values of the dependent variable whilecontrolling the first function. And so on.7

Discriminant Function Analysis The first function will be the most powerfuldifferentiating dimension. The second and later functions may also representadditional significant dimensions of differentiation.8

Discriminant Function Analysis Assumptions (from SPSS 19.0 help) Cases should be independent. Predictor variables should have a multivariate normaldistribution, and within-group variance-covariancematrices should be equal across groups. Group membership is assumed to be mutuallyexclusive The procedure is most effective when groupmembership is a truly categorical variable; if groupmembership is based on values of a continuousvariable (for example, high IQ versus low IQ),consider using linear regression to take advantage ofthe richer information that is offered by thecontinuous variable itself.9

Discriminant Function Analysis Assumptions(similar to those for linear regression) Linearity, normality, multilinearity, equalvariances Predictor variables should have a multivariatenormal distribution. fairly robust to violations of the most of theseassumptions. But highly sensitive to outliers. Model specification10

Discriminant Function Analysis Test of significance For two groups, the null hypothesis is that themeans of the two groups on the discriminantfunction-the centroids, are equal. Centroids are the mean discriminant score foreach group. Wilk’s lambda is used to test for significantdifferences between groups. Wilk’s lambda is between 0 and 1. It tells us thevariance of dependent variable that is notexplained by the discriminant function.11

Discriminant Function Analysis Wilk’s lambda is also used to test for significantdifferences between the groups on the individualpredictor variables. It tells which variables contribute a significant amount ofprediction to help separate the groups.12

Discriminant Function Analysis Two groups using an example from SPSS manual Example: the purpose of this example is to identifycharacteristics that are indicative of people who arelikely to default on loans, and use those characteristicsto identify good and bad credit risks. Sample includes a total of 850 cases (old andnew/future customers) The first 700 cases arecustomers who were previously given loans. Use first 700 customers to create a discriminant analysismodel, setting the remaining 150 customers aside tovalidate the analysis. Then use the model to classify the 150 prospectivecustomers as good or bad credit risks.13

Discriminant Function Analysis Grouping variable: Default Predictors: employ, address, debtinc, andcreaddebt Obtain Discriminant function analysis Analyze Classify Discriminant14

Discriminant Function Analysis15

Discriminant Function Analysis16

Discriminant Function Analysis Click Classify to get this window17

Discriminant Function Analysis Click Save to get this window18

Discriminant Function Analysis SPSS Output: descriptive statistics19

Discriminant Function Analysis SPSS output: ANOVA tableIn the ANOVA table, the smaller the Wilks's lambda, the moreimportant the independent variable to the discriminant function.Wilks's lambda is significant by the F test for all independentvariables.20

Discriminant Function Analysis SPSS Output (correlation matrix)The within-groups correlation matrix shows thecorrelations between the predictors.21

Discriminant Function Analysis SPSS output: test of homogeneity ofcovariance matricesThe larger the log determinantin the table, the more thatgroup's covariance matrixdiffers. The "Rank" columnindicates the number ofindependent variables in thiscase. Since discriminantanalysis assumes homogeneityof covariance matrices betweengroups, we would like to seethe determinants be relativelyequal.22

Discriminant Function Analysis SPSS output: test of homogeneity ofcovariance matrices231. Box's M test tests the assumption ofhomogeneity of covariance matrices.This test is very sensitive to meetingthe assumption of multivariatenormality.2. Discriminant function analysis isrobust even when the homogeneity ofvariances assumption is not met,provided the data do not containimportant outliers.3. For our data, we conclude the groupsdo differ in their covariance matrices,violating an assumption of DA.4. when n is large, small deviationsfrom homogeneity will be foundsignificant, which is why Box's M mustbe interpreted in conjunction withinspection of the log determinants.

Discriminant Function Analysis SPSS output: test of homogeneity ofcovariance matrices241. The larger the eigenvalue, the moreof the variance in the dependentvariable is explained by thatfunction.2. Dependent has two categories, thereis only one discriminant function.3. The canonical correlation is themeasure of association between thediscriminant function and thedependent variable.4. The square of canonical correlationcoefficient is the percentage ofvariance explained in the dependentvariable.

Discriminant Function Analysis SPSS output: summary of canonical discriminantfunctionsWhen there are two groups, thecanonical correlation is the mostuseful measure in the table, and it isequivalent to Pearson's correlationbetween the discriminant scores andthe groups.Wilks' lambda is a measure of howwell each function separates casesinto groups. Smaller values of Wilks'lambda indicate greaterdiscriminatory ability of the function.25The associated chi-square statistic tests thehypothesis that the means of the functions listed areequal across groups. The small significance valueindicates that the discriminant function does betterthan chance at separating the groups.

Discriminant Function Analysis SPSS output: summary of canonical discriminantfunctionsThe standardized discriminant functioncoefficients in the table serve the samepurpose as beta weights in multipleregression (partial coefficient) : theyindicate the relative importance of theindependent variables in predicting thedependent. They allow you to comparevariables measured on different scales.Coefficients with large absolute valuescorrespond to variables with greaterdiscriminating ability.26

Discriminant Function Analysis SPSS output: summary of canonical discriminantfunctions1.The structure matrix table shows thecorrelations of each variable with eachdiscriminant function.2. Only one discriminant function is inthis study.3. The correlations then serve likefactor loadings in factor analysis -that is, by identifying the largestabsolute correlations associated witheach discriminant function theresearcher gains insight into how toname each function.27

Discriminant Function Analysis SPSS output: summary of canonical discriminantfunctions1. Discriminant function is a latentvariable that is created as alinear combination ofindependent variables.2. Discriminating variables areindependent variables.3. The table shows the Pearsoncorrelations between predictorsand standardized canonicaldiscriminant functions.4. Loading .30 may be removedfrom the model.28

Discriminant Function Analysis SPSS output: summary of canonicaldiscriminant functionsThis table contains theunstandardized discriminantfunction coefficients. Thesewould be used likeunstandardized b (regression)coefficients in multipleregression -- that is, they areused to construct the actualprediction equation which canbe used to classify new cases.29

Discriminant Function Analysis Discriminant function: our model should belike this:Di 0.058 – 0.12emplo – 0.037addres 0.075debin 0.312 credebt30

Discriminant Function Analysis SPSS output: summary of canonicaldiscriminant functions31Centroids are the mean discriminant scores for each group. This table is used toestablish the cutting point for classifying cases. If the two groups are of equalsize, the best cutting point is half way between the values of the functions atgroup centroids (that is, the average). If the groups are unequal, the optimalcutting point is the weighted average of the two values. The computer does theclassification automatically, so these values are for informational purposes.

Discriminant Function Analysis The centroids are calculated based on the function:Di 0.058 – 0.12emplo – 0.037addres 0.075debin 0.312 credebt Centroids are discriminant score for each groupwhen the variable means (rather than individualvalues for each case) are entered into the function.32

Discriminant Function Analysis SPSS Output: Classification Statistics33Prior Probabilities are used inclassification. The default isusing observed group sizes. Inyour sample to determine theprior probabilities ofmembership in the groupsformed by the dependent, andthis is necessary if you havedifferent group sizes. If eachgroup is of the same size, as analternative you could specifyequal prior probabilities for allgroups.

Discriminant Function Analysis SPSS Output: Classification StatisticsTwo sets (one for eachdependent group) ofunstandardized lineardiscriminant coefficients arecalculated, which can be usedto classify cases. This is theclassical method ofclassification, though now littleused.34

Discriminant Function Analysis SPSS Output: Classification StatisticsThis table is used to assess howwell the discriminant functionworks, and if it works equally wellfor each group of the dependentvariable. Here it correctlyclassifies more than 75% of thecases, making about the sameproportion of mistakes for bothcategories. Overall, 76.0% of thecases are correctly classified.35

Discriminant Function Analysis SPSS Output: separate-group plotsIf two distributionsoverlap too much, itmeans they do notdiscriminate too (poordiscriminant function).36

Discriminant Function Analysis Run DA37

Discriminant Function Analysis We get this classification table38Sensitivity 45.4%; specificity 94.2%

Discriminant Function Analysis The table from previous slide shows howaccurately the customers were classifiedinto these groups. Sensitivity: highly sensitive test means thatthere are few false negative results (Type IIerror) Specificity: highly specific test means thatthere few false positive results (Type Ierror).39

Discriminant Function Analysis Discriminant methods Enter all independent variables into theequation at once Stepwise: remove independent variables thatare not significant.40

Discriminant Function Analysis Discriminant Function Analysis (more than twoGroups) Example from SPSS mannual. A telecommunications provider hassegmented its customer base by serviceusage patterns, categorizing the customersinto four groups. If demographic data canbe used to predict group membership, youcan customize offers for individualprospective customers.41

Discriminant Function Analysis Variables in the analysis Dependent variable custcat (fourcategories) Independent variables: demographics Obtain discriminant function analysis Analyze Classify Discriminant42

Discriminant Function Analysis43When you have a lot of predictors, the stepwise methodcan be useful by automatically selecting the "best"variables to use in the model.

Discriminant Function Analysis Click Method44

Discriminant Function Analysis Method Wilks' lambda. A variable selection method forstepwise discriminant analysis that chooses variablesfor entry into the equation on the basis of how muchthey lower Wilks' lambda. At each step, the variablethat minimizes the overall Wilks' lambda is entered. Unexplained variance. At each step, the variable thatminimizes the sum of the unexplained variationbetween groups is entered. Mahalanobis distance. A measure of how much acase's values on the independent variables differfrom the average of all cases. A large Mahalanobisdistance identifies a case as having extreme valueson one or more of the independent variables.45

Discriminant Function Analysis Method Smallest F ratio. A method of variableselection in stepwise analysis based onmaximizing an F ratio computed from theMahalanobis distance between groups. Rao's V. A measure of the differencesbetween group means. Also called theLawley-Hotelling trace. At each step, thevariable that maximizes the increase inRao's V is entered. After selecting thisoption, enter the minimum value avariable must have to enter the analysis.46

Discriminant Function Analysis Use F value. A variable is entered into the model if its Fvalue is greater than the Entry value and is removed ifthe F value is less than the Removal value. Entry mustbe greater than Removal, and both values must bepositive. To enter more variables into the model, lowerthe Entry value. To remove more variables from themodel, increase the Removal value. Use probability of F. A variable is entered into the modelif the significance level of its F value is less than theEntry value and is removed if the significance level isgreater than the Removal value. Entry must be less thanRemoval, and both values must be positive. To entermore variables into the model, increase the Entry value.To remove more variables from the model, lower theRemoval value.47

Discriminant Function Analysis SPSS output The stepwise method starts with a modelthat doesn't include any of the predictors(step 0). At each step, the predictor with the largestF to Enter, value that exceeds the entrycriteria (by default, 3.84) is added to themodel.48

Discriminant Function Analysis SPSS output49

Discriminant Function Analysis SPSS output: variables in the analysis50Tolerance is the proportion of a variable's variance notaccounted for by other independent variables in the equation.A variable with very low tolerance contributes littleinformation to a model and can cause computationalproblems.Actually, tolerance is about multicollinearity. .40 is worthy ofconcern. .10 is problematic.

Discriminant Function Analysis SPSS output: variables in the analysisF to Remove values are useful for describing what happens if a variableis removed from the current model (given that the other variablesremain).51

Discriminant Function Analysis SPSS output: Summary of Canonical DiscriminantFunctionsNearly all of the variance explained by the model is due to the firsttwo discriminant functions. We can ignore the third function. For eachset of functions, this tests the hypothesis that the means of thefunctions listed are equal across groups. The test of function 3 has ap value of .34, so this function contributes little to the model.52

Discriminant Function Analysis SPSS outputThe structure matrix tableshows the correlations ofeach variable with eachdiscriminant function. Thecorrelations serve like factorloadings in factor analysis -that is, by identifying thelargest absolute correlationsassociated with eachdiscriminant function.53

Discriminant Function Analysis Three discriminant functionsFunction 1: -2.84 .88ed .01em .18reFunction 2: -1.86 .12ed .10em .17reFunction 3: -.82 - .22ed -.01em .66re54

Discriminant Function Analysis SPSS Output: combined-group plotThe closer the groupcentroids, the moreerrors of classificationlikely will be.55

Discriminant Function Analysis SPSS output: classification table56The model excels at identifying Total service customers. However, it doesan exceptionally poor job of classifying E-service customers. You may needto find another predictor in order to separate these customers.

Discriminant Function Analysis We have created a discriminant model thatclassifies customers into one of four predefined"service usage" groups, based on demographicinformation from each customer. Using thestructure matrix, we identified which variables aremost useful for segmenting the customer base.Lastly, the classification results show that themodel does poorly at classifying E-servicecustomers. If identifying E-service customers is notthe concern, the model may be accurate enoughfor this purpose.57

Discriminant Function Analysis58

Two groups using an example from SPSS manual Example: the purpose of this example is to identify characteristics that are indicative of people who are likely to default on loans, and use those characteristics to identify good and bad credit risks. Sample includes a total of 850 cas