The Many Flavors Of Penalized Linear Discriminant Analysis

Transcription

Linear Discriminant AnalysisPenalized LDAConnectionsThe Many Flavors ofPenalized Linear Discriminant AnalysisDaniela M. WittenAssistant Professor of BiostatisticsUniversity of WashingtonMay 9, 2011Fourth Erich L. Lehmann SymposiumRice University1 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsOverviewIThere has been a great deal of interest in the past 15 yearsin penalized regression,minimize{ y Xβ 2 P(β)},βIespecially in the setting where the number of features pexceeds the number of observations n.P is a penalty function. Could be chosen to promoteIIIsparsity: e.g. the lasso, P(β) β 1smoothnesspiecewise constancy.IHow can we extend the concepts developed for regressionwhen p n to other problems?IA Case Study: Penalized linear discriminant analysis.2 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringThe classification problemIThe Set-up:IIIIWe are given n training observations x1 , . . . , xn Rp , each ofwhich falls into one of K classes.Let y {1, . . . , K }n contain class memberships for the trainingobservations. T x1 Let X . .xTnEach column of X (feature) is centered to have mean zero.3 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringThe classification problemIThe Set-up:IIIIIWe are given n training observations x1 , . . . , xn Rp , each ofwhich falls into one of K classes.Let y {1, . . . , K }n contain class memberships for the trainingobservations. T x1 Let X . .xTnEach column of X (feature) is centered to have mean zero.The Goal:IIWe wish to develop a classifier based on the trainingobservations x1 , . . . , xn Rp , that we can use to classify a testobservation x Rp .A classical approach: linear discriminant analysis.3 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringLinear discriminant analysis4 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringLDA via the normal modelIFit a simple normal model to the data:xi yi k N(µk , Σw )IApply Bayes’ Theorem to obtain a classifier: assign x to theclass for which δk (x ) is largest:1 T 1δk (x ) x T Σ 1w µk µk Σw µk logπk25 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringFisher’s discriminantA geometric perspective: project the data to achieve goodclassification.6 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringFisher’s discriminantA geometric perspective: project the data to achieve goodclassification.6 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringFisher’s discriminantA geometric perspective: project the data to achieve goodclassification.6 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringFisher’s discriminantA geometric perspective: project the data to achieve goodclassification.6 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringFisher’s discriminant and the associated criterionLook for the discriminant vector β Rp that maximizesβ T Σ̂b β subject to β T Σ̂w β 1.IΣ̂b is an estimate for the between-class covariance matrix.IΣ̂w is an estimate for the within-class covariance matrix.IThis is a generalized eigen problem; can obtain multiplediscriminant vectors.ITo classify, multiply data by discriminant vectors and performnearest centroid classification in this reduced space.IIf we use K 1 discriminant vectors then we get the LDAclassification rule. If we use fewer than K 1, we getreduced-rank LDA.7 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringLDA via optimal scoringIClassification is such a bother. Isn’t regression so much nicer?IIt wouldn’t make sense to solveminimize{ y Xβ 2 }.βIBut can we formulate classification as a regression problem insome other way?8 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringLDA via optimal scoringILet Y be a n K matrix of dummy variables; Yik 1yi k .minimize{ Yθ Xβ 2 } subject to θ T YT Yθ 1.β,θIWe are choosing the optimal scoring of the class labels inorder to recast the classification problem as a regressionproblem.IThe resulting β is proportional to the discriminant vector inFisher’s discriminant problem.ICan obtain the LDA classification rule, or reduced-rank LDA.9 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringLinear discriminant analysis10 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemLDA when p nWhen p n, we cannot apply LDA directly, because thewithin-class covariance matrix is singular.There is also an interpretability issue:IIAll p features are involved in the classification rule.We want an interpretable classifier. For instance, aclassification rule that is aIIIsparse,smooth, orpiecewise constantlinear combination of the features.11 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemPenalized LDAIIIWe could extend LDA to the high-dimensional setting byapplying (convex) penalties, in order to obtain aninterpretable classifier.For concreteness, in this talk: we will use 1 penalties in orderto obtain a sparse classifier.Which version of LDA should we penalize, and does it matter?12 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemPenalized LDA via the normal modelIThe classification rule for LDA is1 T 1x T Σ̂ 1w µ̂k µ̂k Σ̂w µ̂k ,2where Σ̂w and µ̂k denote MLEs for Σw and µk .IWhen p n, we cannot invert Σ̂w .ICan use a regularized estimate of 2σ̂1 0 0 σ̂22 ΣD w . . . .0 .Σw , such as . 0. . . . . 0 0 σ̂p213 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemInterpretable class centroids in the normal modelIIFor a sparse classifier, we need zeros in estimate of Σ 1w µk .An interpretable classifier:IUse ΣDw , and estimate µk according to p X X(Xij µkj )2 λ µ .minimize1kµk σj2j 1 i:yi kIIApply Bayes’ Theorem to obtain a classification rule.This is the nearest shrunken centroids proposal, which yields asparse classifier because we are using a diagonal estimate ofthe within-class covariance matrix and a sparse estimate ofthe class mean vectors.Citation: Tibshirani et al. 2003, Stat Sinica14 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemPenalized LDA via optimal scoringIWe can easily extend the optimal scoring criterion:1minimize{ Yθ Xβ 2 λ β 1 } subject to θ T YT Yθ 1.β,θnIAn efficient iterative algorithm will find a local optimum.IWe get sparse discriminant vectors, and hence classificationusing a subset of the features.Citation: Clemmensen Hastie Witten and Ersboll 2011, Submitted15 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemPenalized LDA via Fisher’s discriminant problemIA simple formulation:maximize{β T Σ̂b β λ β 1 )} subject to β T Σ̃w β 1βwhere Σ̃w is some full rank estimate of Σw .IA non-convex problem, because β T Σ̂b β isn’t concave in β.ICan we find a local optimum?Citation: Witten and Tibshirani 2011, JRSSB16 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemMaximizing a function via minorization17 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemMaximizing a function via minorization17 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemMaximizing a function via minorization17 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemMaximizing a function via minorization17 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemMaximizing a function via minorization17 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemMaximizing a function via minorization17 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemMaximizing a function via minorization17 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemMaximizing a function via minorization17 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemMinorizationIKey point: Choose a minorizing function that is easy tomaximize.IMinorization allows us to efficiently find a local optimum forFisher’s discriminant problem with any convex penalty.18 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Connections between flavors of penalized LDA19 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Connections between flavors of penalized LDA1. Normal Model 1 : use a diagonal estimate for Σw and thenapply 1 penalty to the class mean vectors.2. Optimal scoring 1 : apply 1 penalty to discriminantvectors.3. Fisher’s discriminant problem 1 : apply 1 penalty todiscriminant vectors.So are (1) and (3) different? And are (2) and (3) the same?20 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Normal Model 1 and Fisher’s 121 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Normal Model 1 and Fisher’s 1INormal model 1 penalizes theelements of this matrix.IFisher’s 1 penalizes the left singularvectors.IClearly these are different.I.but if K 2, then they are (essentially)the same.22 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Normal Model 1 and Fisher’s 123 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Fisher’s 1 and Optimal Scoring 1Both problems involve “penalizing the discriminant vectors” sothey must be the same, right?24 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Fisher’s 1 and Optimal Scoring 1Theorem: For any value of the tuning parameter for FD 1 , thereexists some tuning parameter for OS 1 such that the solution toone problem is a critical point of the other.IIn other words – there is a correspondence between the criticalpoints, though not necessarily the solutions.ISo the resulting “sparse discriminant vectors” may bedifferent!25 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Connections26 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsPros and ConsPenalized LDA via normal model:I ( ) In the case of a diagonal estimate for Σw and 1 penalties on meanvectors, it is well-motivated and simple.I (-) No obvious extension to non-diagonal estimates of Σw .I (-) Cannot obtain a “low-rank” classifier.PenalizedI ( )I ( )I ( )LDA via Fisher’s discriminant problem:Any convex penalties can be applied to discriminant vectors.Can use any full-rank estimate of Σw .Can obtain a “low-rank” classifier.Penalized LDA via optimal scoring:I ( ) An extension of regression.I ( ) Any convex penalties can be applied to discriminant vectors.I ( ) Can obtain a “low-rank” classifier.I (-) Cannot use any estimate of Σw .I (-) Usual motivation for OS is that it yields the same discriminant vectorsas Fisher’s problem. Not true when penalized!27 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsConclusionsIA sensible way to regularize regression when p n:minimize{ y Xβ 2 P(β)}.βIOne could argue that this is the way to penalize regression.IBut as soon as we step away from regression, even to a closelyrelated problem like LDA, the situation becomes much morecomplex – there is no longer a “single way” to approach theproblem.IAnd the situation becomes only more complex for morecomplex statistical methods!INeed a principled framework to determine which penalizedextension of established statistical methods is “best”.28 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsReferencesIWitten and Tibshirani (2011) Penalized classification usingFisher’s linear discriminant. To appear in Journal of the RoyalStatistical Society, Series B.IClemmensen, Hastie, Witten, and Ersboll (2011) Sparsediscriminant analysis. Submitted.29 / 29

Linear Discriminant Analysis Penalized LDA Connections The Normal Model Fisher's Discriminant Problem Optimal Scoring The classi cation problem I The Set-up: I We are given n training observations x 1;:::;x n 2Rp, each of which falls into one of K classes.