The Many Flavors Of Penalized Linear Discriminant Analysis PDF Free Download

1y ago

26 Views

1 Downloads

1.06 MB

40 Pages

Report/dmca

Download PDF

Transcription

Linear Discriminant AnalysisPenalized LDAConnectionsThe Many Flavors ofPenalized Linear Discriminant AnalysisDaniela M. WittenAssistant Professor of BiostatisticsUniversity of WashingtonMay 9, 2011Fourth Erich L. Lehmann SymposiumRice University1 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsOverviewIThere has been a great deal of interest in the past 15 yearsin penalized regression,minimize{ y Xβ 2 P(β)},βIespecially in the setting where the number of features pexceeds the number of observations n.P is a penalty function. Could be chosen to promoteIIIsparsity: e.g. the lasso, P(β) β 1smoothnesspiecewise constancy.IHow can we extend the concepts developed for regressionwhen p n to other problems?IA Case Study: Penalized linear discriminant analysis.2 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringThe classification problemIThe Set-up:IIIIWe are given n training observations x1 , . . . , xn Rp , each ofwhich falls into one of K classes.Let y {1, . . . , K }n contain class memberships for the trainingobservations. T x1 Let X . .xTnEach column of X (feature) is centered to have mean zero.3 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringThe classification problemIThe Set-up:IIIIIWe are given n training observations x1 , . . . , xn Rp , each ofwhich falls into one of K classes.Let y {1, . . . , K }n contain class memberships for the trainingobservations. T x1 Let X . .xTnEach column of X (feature) is centered to have mean zero.The Goal:IIWe wish to develop a classifier based on the trainingobservations x1 , . . . , xn Rp , that we can use to classify a testobservation x Rp .A classical approach: linear discriminant analysis.3 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringLinear discriminant analysis4 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringLDA via the normal modelIFit a simple normal model to the data:xi yi k N(µk , Σw )IApply Bayes’ Theorem to obtain a classifier: assign x to theclass for which δk (x ) is largest:1 T 1δk (x ) x T Σ 1w µk µk Σw µk logπk25 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringFisher’s discriminantA geometric perspective: project the data to achieve goodclassification.6 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringFisher’s discriminant and the associated criterionLook for the discriminant vector β Rp that maximizesβ T Σ̂b β subject to β T Σ̂w β 1.IΣ̂b is an estimate for the between-class covariance matrix.IΣ̂w is an estimate for the within-class covariance matrix.IThis is a generalized eigen problem; can obtain multiplediscriminant vectors.ITo classify, multiply data by discriminant vectors and performnearest centroid classification in this reduced space.IIf we use K 1 discriminant vectors then we get the LDAclassification rule. If we use fewer than K 1, we getreduced-rank LDA.7 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringLDA via optimal scoringIClassification is such a bother. Isn’t regression so much nicer?IIt wouldn’t make sense to solveminimize{ y Xβ 2 }.βIBut can we formulate classification as a regression problem insome other way?8 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringLDA via optimal scoringILet Y be a n K matrix of dummy variables; Yik 1yi k .minimize{ Yθ Xβ 2 } subject to θ T YT Yθ 1.β,θIWe are choosing the optimal scoring of the class labels inorder to recast the classification problem as a regressionproblem.IThe resulting β is proportional to the discriminant vector inFisher’s discriminant problem.ICan obtain the LDA classification rule, or reduced-rank LDA.9 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelFisher’s Discriminant ProblemOptimal ScoringLinear discriminant analysis10 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemLDA when p nWhen p n, we cannot apply LDA directly, because thewithin-class covariance matrix is singular.There is also an interpretability issue:IIAll p features are involved in the classification rule.We want an interpretable classifier. For instance, aclassification rule that is aIIIsparse,smooth, orpiecewise constantlinear combination of the features.11 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemPenalized LDAIIIWe could extend LDA to the high-dimensional setting byapplying (convex) penalties, in order to obtain aninterpretable classifier.For concreteness, in this talk: we will use 1 penalties in orderto obtain a sparse classifier.Which version of LDA should we penalize, and does it matter?12 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemPenalized LDA via the normal modelIThe classification rule for LDA is1 T 1x T Σ̂ 1w µ̂k µ̂k Σ̂w µ̂k ,2where Σ̂w and µ̂k denote MLEs for Σw and µk .IWhen p n, we cannot invert Σ̂w .ICan use a regularized estimate of 2σ̂1 0 0 σ̂22 ΣD w . . . .0 .Σw , such as . 0. . . . . 0 0 σ̂p213 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemInterpretable class centroids in the normal modelIIFor a sparse classifier, we need zeros in estimate of Σ 1w µk .An interpretable classifier:IUse ΣDw , and estimate µk according to p X X(Xij µkj )2 λ µ .minimize1kµk σj2j 1 i:yi kIIApply Bayes’ Theorem to obtain a classification rule.This is the nearest shrunken centroids proposal, which yields asparse classifier because we are using a diagonal estimate ofthe within-class covariance matrix and a sparse estimate ofthe class mean vectors.Citation: Tibshirani et al. 2003, Stat Sinica14 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemPenalized LDA via optimal scoringIWe can easily extend the optimal scoring criterion:1minimize{ Yθ Xβ 2 λ β 1 } subject to θ T YT Yθ 1.β,θnIAn efficient iterative algorithm will find a local optimum.IWe get sparse discriminant vectors, and hence classificationusing a subset of the features.Citation: Clemmensen Hastie Witten and Ersboll 2011, Submitted15 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemPenalized LDA via Fisher’s discriminant problemIA simple formulation:maximize{β T Σ̂b β λ β 1 )} subject to β T Σ̃w β 1βwhere Σ̃w is some full rank estimate of Σw .IA non-convex problem, because β T Σ̂b β isn’t concave in β.ICan we find a local optimum?Citation: Witten and Tibshirani 2011, JRSSB16 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemMaximizing a function via minorization17 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsThe Normal ModelOptimal ScoringFisher’s Discriminant ProblemMinorizationIKey point: Choose a minorizing function that is easy tomaximize.IMinorization allows us to efficiently find a local optimum forFisher’s discriminant problem with any convex penalty.18 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Connections between flavors of penalized LDA19 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Connections between flavors of penalized LDA1. Normal Model 1 : use a diagonal estimate for Σw and thenapply 1 penalty to the class mean vectors.2. Optimal scoring 1 : apply 1 penalty to discriminantvectors.3. Fisher’s discriminant problem 1 : apply 1 penalty todiscriminant vectors.So are (1) and (3) different? And are (2) and (3) the same?20 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Normal Model 1 and Fisher’s 121 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Normal Model 1 and Fisher’s 1INormal model 1 penalizes theelements of this matrix.IFisher’s 1 penalizes the left singularvectors.IClearly these are different.I.but if K 2, then they are (essentially)the same.22 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Normal Model 1 and Fisher’s 123 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Fisher’s 1 and Optimal Scoring 1Both problems involve “penalizing the discriminant vectors” sothey must be the same, right?24 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Fisher’s 1 and Optimal Scoring 1Theorem: For any value of the tuning parameter for FD 1 , thereexists some tuning parameter for OS 1 such that the solution toone problem is a critical point of the other.IIn other words – there is a correspondence between the criticalpoints, though not necessarily the solutions.ISo the resulting “sparse discriminant vectors” may bedifferent!25 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsNormal 1 and Fisher’s 1Fisher’s 1 and Optimal Scoring 1Connections26 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsPros and ConsPenalized LDA via normal model:I ( ) In the case of a diagonal estimate for Σw and 1 penalties on meanvectors, it is well-motivated and simple.I (-) No obvious extension to non-diagonal estimates of Σw .I (-) Cannot obtain a “low-rank” classifier.PenalizedI ( )I ( )I ( )LDA via Fisher’s discriminant problem:Any convex penalties can be applied to discriminant vectors.Can use any full-rank estimate of Σw .Can obtain a “low-rank” classifier.Penalized LDA via optimal scoring:I ( ) An extension of regression.I ( ) Any convex penalties can be applied to discriminant vectors.I ( ) Can obtain a “low-rank” classifier.I (-) Cannot use any estimate of Σw .I (-) Usual motivation for OS is that it yields the same discriminant vectorsas Fisher’s problem. Not true when penalized!27 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsConclusionsIA sensible way to regularize regression when p n:minimize{ y Xβ 2 P(β)}.βIOne could argue that this is the way to penalize regression.IBut as soon as we step away from regression, even to a closelyrelated problem like LDA, the situation becomes much morecomplex – there is no longer a “single way” to approach theproblem.IAnd the situation becomes only more complex for morecomplex statistical methods!INeed a principled framework to determine which penalizedextension of established statistical methods is “best”.28 / 29

Linear Discriminant AnalysisPenalized LDAConnectionsReferencesIWitten and Tibshirani (2011) Penalized classification usingFisher’s linear discriminant. To appear in Journal of the RoyalStatistical Society, Series B.IClemmensen, Hastie, Witten, and Ersboll (2011) Sparsediscriminant analysis. Submitted.29 / 29

Linear Discriminant Analysis Penalized LDA Connections The Normal Model Fisher's Discriminant Problem Optimal Scoring The classi cation problem I The Set-up: I We are given n training observations x 1;:::;x n 2Rp, each of which falls into one of K classes.