Linear Discriminant Analysis (LDA) - GitHub Pages

Transcription

Linear Discriminant Analysis (LDA)Yang XiaozhouMarch 18, 2020Industrial Systems Engineering and Management, NUS

Table of contents1. LDA2. Reduced-rank LDA3. Fisher’s LDA4. Flexible Discriminant AnalysisMarch 18, 20201

LDA

LDA and its applicationsLDA is used as a tool for classification. Bankruptcy prediction: Edward Altman’s 1968 model Face recognition: learnt features are called Fisher faces Biomedical studies: discriminate different stages of a disease and many moreIt has shown some really good results: Top 3 classifiers for 11 of the 22 datasets studied in the STATLOGproject11 (MichieMarch 18, 2020et al. 1994)2

Classification by discriminant analysisConsider a generic classification problem: K groups: G 1, . . . , K , each with a density fk (x) on Rp . A discriminant rule divides the space into K disjoint regionsR1 , . . . , RK andallocate x to Πj if x Rj .March 18, 20203

Classification by discriminant analysisConsider a generic classification problem: K groups: G 1, . . . , K , each with a density fk (x) on Rp . A discriminant rule divides the space into K disjoint regionsR1 , . . . , RK andallocate x to Πj if x Rj . Maximum likelihood rule:allocate x to Πj if j arg max fi (x) ,i Bayesian rule with class priors π1 , . . . , πK :allocate x to Πj if j arg max πi fi (x) .iMarch 18, 20203

Gaussian as class densityIf we assume data comes from Gaussian distribution:T 11fk (x) (2π)p/21 Σ 1/2 e 2 (x µk ) Σk (x µk )k Parameters are estimated using training data: π̂k , µ̂k , Σ̂k . Looking at the log-likelihod:allocate x to Πj if j arg max δi (x) .iδi (x) log fi (x) log πi is called discriminant function.March 18, 20204

Gaussian as class densityIf we assume data comes from Gaussian distribution:T 11fk (x) (2π)p/21 Σ 1/2 e 2 (x µk ) Σk (x µk )k Parameters are estimated using training data: π̂k , µ̂k , Σ̂k . Looking at the log-likelihod:allocate x to Πj if j arg max δi (x) .iδi (x) log fi (x) log πi is called discriminant function.Assume equal covariance among K classes: LDA1δk (x) xT Σ 1 µk µTΣ 1 µk log πk2 kWithout that assumption on class covariance: QDA.March 18, 20204

Decision boundary: LDA vs QDA Between any pair of classes k and , the decision boundary is:{x : δk (x) δ (x)} LDA: linear boundary; QDA: quadratic boundary. Number of parameters to estimate rises quickly in QDA: LDA: (K 1)(p 1) QDA: (K 1){p(p 3)/2 1}March 18, 20205

Reduced-rank LDA

Reduced-rank LDAComputation for LDA: Sphere the data:1x D 2 UT x ,where Σ̂ UDUT . Classify x to the closest centroid in the transformed space:1δk (x ) x T µ̂k µ̂Tµ̂k log πk .2 kMarch 18, 20206

Reduced-rank LDAComputation for LDA: Sphere the data:1x D 2 UT x ,where Σ̂ UDUT . Classify x to the closest centroid in the transformed space:1δk (x ) x T µ̂k µ̂Tµ̂k log πk .2 kInherent dimension reduction in LDA: K centroids lie in a subspace of dimension at most (K 1):HK 1 µ1 span {µi µ1 , 2 i K } Classification is done by distance comparison in HK 1 . p K 1 dimension reduction assuming p K .March 18, 20206

Reduced-rank LDAWe can look for an even smaller subspace HL HK 1 : Rule: Class centroids of sphered data have maximum separation inthis subspace in terms of variance.March 18, 20207

Reduced-rank LDAWe can look for an even smaller subspace HL HK 1 : Rule: Class centroids of sphered data have maximum separation inthis subspace in terms of variance.PCA on class centroids to find coordinates of HL .1. Find class mean and pooled var-cov: M, W.12. Sphere the centroids: M MW 2 .3. Obtain eigenvectors (v ) in V of cov(M ) V DB V T .14. Obtain new (discriminant) variables Z (W 2 v )T X , 1, . . . , L.Dimension reduction: XN p ZN L .March 18, 20207

Reduced-rank LDA vs PCAWine dataset: 13 variables to distinguish three types of wines.602400PC 2Discriminant Coordinate 2LDA for Wine dataset42PCA for Wine dataset20042066March 18, 20204202Discriminant Coordinate 1464002000200PC 1400600800 10008

Fisher’s LDA

Fisher’s LDAThe previous rule is proposed by Fisher: Find a linear combination Z aT X that has maximumbetween-class variance relative to its within-class variance:aT Baa arg max T.a a WaMarch 18, 20209

Fisher’s LDAThe previous rule is proposed by Fisher: Find a linear combination Z aT X that has maximumbetween-class variance relative to its within-class variance:aT Baa arg max T.a a Wa The optimization is solved by a generalized eigenvalue problem:W 1 Ba λa.1 Eigenvectors (a ) of W 1 B are the same as (W 2 v ). Fisherarrives at this without Gaussian assumption.March 18, 20209

Fisher’s LDADigit dataset: 64 variables to distinguish 10 written digits. Top 4 of Fisher’s discriminant variables are shown. For example, coordinate 1 contrasts 4’s and 2/3’s.66Discriminant Coordinate 4Discriminant Coordinate 24202442024610March 18, 202086420Discriminant Coordinate 124664202Discriminant Coordinate 3410

Summary of LDAVirtues of LDA:1. Simple prototype classifier: simple to interpret.2. Decision boundary is linear: simple to describe and implement.3. Dimension reduction: provides informative low-dimensional view ondata.Shortcomings of LDA:1. Linear decision boundaries may not adequately separate the classes.Support for more general boundaries is desired.2. In high-dimensional setting, LDA uses too many parameters.Regularized version of LDA is desired.March 18, 202011

Flexible Discriminant Analysis

Beyond linear boundaries: FDAFlexible discriminant analysis (FDA) can tackle the first shortcoming.LDA Decision BoundariesQDA Decision Boundaries45y12X2X2y012033 4 5 50X15 505X1Idea: Recast LDA as a regression problem, apply the same techniquesgeneralizing linear regression.March 18, 202012

LDA as a regression problemWe can recast LDA as a regression problem via optimal scoring.Set up: Response G falls into one of K classes, G {1, . . . , K }. X is the p-dimensional feature vector.Suppose a scoring function:θ : G 7 R1such that scores are optimally predicted by regressing on X , e.g. a linearmap η(X ) X T β.March 18, 202013

LDA as a regression problemWe can recast LDA as a regression problem via optimal scoring.Set up: Response G falls into one of K classes, G {1, . . . , K }. X is the p-dimensional feature vector.Suppose a scoring function:θ : G 7 R1such that scores are optimally predicted by regressing on X , e.g. a linearmap η(X ) X T β.In general, select L K 1 such scoring functions and find the optimal{score, linear map} pairs that minimize:#" NL 1 X X2θ (gi ) x TASR i β N 1March 18, 2020i 113

LDA via optimal scoringProcedures of LDA via optimal scoring:1. Initialize. Build response indicator matrix YN K where Yij 1 ifith samples comes from jth class, and 0 otherwise.2. Multivariate regression. Regress Y on X using ASR to get PXwhere Ŷ PX Y, and regression coefficients B.3. Optimal scores. Obtain the L largest eigenvectors Θ of YT PX Y.4. Update. Update the coefficients: B BΘMarch 18, 202014

LDA via optimal scoringProcedures of LDA via optimal scoring:1. Initialize. Build response indicator matrix YN K where Yij 1 ifith samples comes from jth class, and 0 otherwise.2. Multivariate regression. Regress Y on X using ASR to get PXwhere Ŷ PX Y, and regression coefficients B.3. Optimal scores. Obtain the L largest eigenvectors Θ of YT PX Y.4. Update. Update the coefficients: B BΘ The optimal linear map is: η(X ) BT X . Columns of B, β 1 , . . . , β , are the same as a ’s in LDA up to aconstant.This equivalence with regression problem provides a starting point forgeneralizing LDA to a more flexible and nonparametric version.March 18, 202014

From LDA to FDAExtend LDA by generalizing the linear map:η(X ) BT Xtoη(X ) BT h(X ). Generalized additive fits Spline functions MARS models Projection pursuits Neural networksThe idea behind FDA: LDA in an enlarged space.March 18, 202015

FDA via optimal scoringThe procedures of FDA is the same as LDA via optimal scoring with onechange: Replace PX with Sh(X ) , the nonparametric regression operator.Initialize Multivariate regression Optimal scores Update.March 18, 202016

FDA via optimal scoringThe procedures of FDA is the same as LDA via optimal scoring with onechange: Replace PX with Sh(X ) , the nonparametric regression operator.Initialize Multivariate regression Optimal scores Update. Optimal fit: η(X). Fitted class centroids: η k Pgi kη (xi ) /N k .A new observation X is classified to class k that minimizes:δ(x, k) D η(x) η j 2where D is the constant factor linking optimal fits and LDA coordinates.March 18, 202016

LDA vs FDAData: three classes with mixture Gaussian densities.FDA uses an additive model using smoothing splines of the form:α pXfj (Xj )1LDA Decision BoundariesFDA Decision Boundaries45y12X2X2y012033 4 5 50X1March 18, 20205 505X117

References1. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements ofstatistical learning (Vol. 1, No. 10). New York: Springer series instatistics.2. Hastie, T., Tibshirani, R., & Buja, A. (1994). Flexible discriminantanalysis by optimal scoring. Journal of the American statisticalassociation, 89(428), 1255-1270.3. Hastie, T., Tibshirani, R., & Buja, A. (1995). Flexible discriminantand mixture models.4. Mardia, K. V., Kent, J. T., & Bibby, J. M. Multivariate analysis.1979. Probability and mathematical statistics. Academic Press Inc.March 18, 202018

Questions?Contact: xiaozhou.yang@u.nus.eduMarch 18, 202018

Flexible discriminant analysis (FDA) can tackle the rst shortcoming.-4 0 4-5 0 5 X1 X2 y 1 2 3 LDA Decision Boundaries-5 0 5-5 0 5 X1 y 1 2 3 QDA Decision Boundaries Idea: Recast LDA as a regression problem, apply the same techniques generalizing linear regression. March 18, 2020 12.