Course 495: Advanced Statistical Machine Learning/Pattern Recognition PDF Free Download

1y ago

28 Views

1 Downloads

1.26 MB

41 Pages

Report/dmca

Download PDF

Transcription

Course 495: Advanced Statistical MachineLearning/Pattern RecognitionDeterministic Component Analysis Goal (Lecture): To present standard and modern ComponentAnalysis (CA) techniques such as Principal ComponentAnalysis (PCA), Linear Discriminant Analysis (LDA),Graph/Neighbourhood based Component Analysis Goal (Tutorials): To provide the students the necessarymathematical tools for deeply understanding the CAtechniques.1Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Materials Pattern Recognition & Machine Learning by C. Bishop Chapter 12 tion." Journal of cognitive neuroscience 3.1 (1991): 71-86. Belhumeur, Peter N., João P. Hespanha, and David J. Kriegman."Eigenfaces vs. fisherfaces: Recognition using class specific linearprojection." Pattern Analysis and Machine Intelligence, IEEETransactions on 19.7 (1997): 711-720. He, Xiaofei, et al. "Face recognition using laplacianfaces." PatternAnalysis and Machine Intelligence, IEEE Transactions on 27.3 (2005):328-340. Belkin, Mikhail, and Partha Niyogi. "Laplacian eigenmaps fordimensionality reduction and data representation." Neuralcomputation 15.6 (2003): 1373-1396.2Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Deterministic Component AnalysisProblem: Given a population of data 𝒙1 , , 𝒙𝑁 𝑅𝐹 (i.e. observations) find alatent space 𝒚1 , , 𝒚𝑁 𝑅𝑑 (usually d 𝐹) which is relevant to a task.𝒙13𝒙2𝒙3𝒙𝑁Stefanos Zafeiriou𝒚1 𝒚2 𝒚3 𝒚𝑁Adv. Statistical Machine Learning (course 495)

Latent Variable ModelsLinear:Parameters:𝑑 𝐹𝑛Non-linear:𝑚𝐹 𝑚𝑛4Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

What to have in mind? What are the properties of my latent space? How do I find it (linear, non-linear)? Which is the cost function? How do I solve the problem?5Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

A first example Let’s assume that we want to find a descriptive latentspace, i.e. best describes my population as a whole. How do I define it mathematically? Idea! This is a statistically machine learning course,isn’t? Hence, I will try to preserve global statisticalproperties.6Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

A first example What are the data statistics that can used to describethe variability of my observations? One such statistic is the variance of the population(how much the data deviate around a mean). Attempt: I want to find a low-dimensional latentspace where the “majority” of variance is preserved(or in other words maximized).7Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (one dimension) Variance of the latent space {𝑦1 , 𝑦2 , , 𝑦𝑁 }𝜎𝑦21 𝑁𝑁𝑁(𝑦𝑖 𝜇𝑦 )2𝑖 1 We want to find𝜇𝑦 𝑦𝑖𝑖 1{𝑦1 𝑜 , 𝑦𝑛 𝑜 } argm𝑎𝑥 𝜎𝑦 2{𝑦1 , ,𝑦𝑛 } But we are missing something The way to do it. Via a linear projection 𝒘i.e., 𝑦𝑖 𝒘𝑇 𝒙𝑖8Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (geometric interpretation of projection)𝒙𝑇 𝒘cos(𝜃) 𝒙 𝒘 𝒙𝒙 𝒙𝑇 𝒙𝒘 𝒘𝑇 𝒘𝒘𝜃𝒙𝑇 𝒘 𝒙 cos(θ) 𝒘 9Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (one dimension)10Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (one dimension)Assuming 𝑦𝑖 𝒘𝑇 𝒙𝑖 latent space𝑦1 , , 𝑦𝑁 {𝒘𝑇 𝒙1 , , 𝒘𝑇 𝒙𝑁 }𝒘0 argmax 𝜎 2𝒘1 arg max𝑤 𝑁(𝒘𝑇 𝒙𝑖 𝒘𝑇 𝝁)21 argmax(𝒘𝑇 (𝒙𝑖 𝝁))2𝑁𝒘1 argmax𝒘𝑇 (𝒙𝑖 𝝁) (𝒙𝑖 𝝁)𝑇 𝒘𝑁𝒘1 𝑇 argmax 𝒘(𝒙𝑖 𝝁) (𝒙𝑖 𝝁)𝑇 𝒘𝑁𝒘 argmax 𝒘𝑇 𝐒𝑡 𝒘1𝝁 𝑁𝑁𝒙𝑖𝑖 1𝒘11Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (one dimension)𝒘𝑜 argmax 𝜎 2𝒘1𝑺𝑡 𝑁1 𝑇 argmax 𝒘 𝐒𝑡 𝒘 0𝑁𝒘(𝒙𝑖 𝝁) (𝒙𝑖 𝝁)𝑇 There is a trivial solution of 𝒘 We can avoid it by adding extra constraints (a fixedmagnitude on 𝒘 ( 𝒘 2 𝒘𝑇 𝐰 1)𝒘0 arg max 𝒘𝑇 𝑺𝑡 𝒘𝒘subject to s. t.12Stefanos Zafeiriou𝒘𝑇 𝒘 1Adv. Statistical Machine Learning (course 495)

PCA (one dimension)Formulate the Lagrangian𝐿 𝒘, λ 𝒘𝑇 𝑺𝑡 𝒘 λ(𝒘𝑇 𝒘 1) 𝒘𝑇 𝑺𝑡 𝒘 2𝑺𝑡 𝒘 𝒘 𝐿 𝟎 𝒘 𝒘𝑇 𝒘 2𝒘 𝒘𝑺𝑡 𝒘 λ𝒘𝒘 is the largest eigenvector of 𝑺𝑡13Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA Properties of 𝑺𝑡1𝑺𝑡 𝑁𝒙𝑖 𝝁𝒙𝑖 𝝁𝑇 1𝑿𝑿𝑻𝑁𝑿 [𝒙1 𝝁, , 𝒙𝑁 𝝁]𝑺𝑡 is a symmetric matrixall eigenvalues are real𝑺𝑡 is a positive semi-definite matrix, i.e. 𝒘 𝟎𝒘𝑇 𝑺𝑡 w 0 (all eigenvalues are non negative)rank 𝑺𝑡 min(𝑁 1, 𝐹)𝑺𝑡 𝑼𝜦𝑼𝑇14𝑼𝑇 𝑼 𝑰 , 𝑼𝑼𝑻 𝑰Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (more dimensions) How can we find a latent space with more than onedimensions? We need to find a set of projections {𝒘1 , , 𝒘𝑑 }𝑦𝑖1𝒘1 𝑇 𝒙𝑖𝒚𝑖 𝑾𝑇 𝒙𝑖𝑦𝑖𝑑𝒘𝑑 𝑇 𝒙𝑖𝒚𝑖𝑾 [𝒘1 , , 𝒘𝑑 ]𝒙𝑖15Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (more dimensions)16Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (more dimensions)Maximize the variance in each dimension1𝑾𝑜 arg max𝑾 𝑁1 arg max𝑾 𝑁𝑑𝑁(𝑦𝑖𝑘 𝜇𝑖𝑘 )2𝑘 1 𝑖 1𝑑𝑁𝒘𝑘 𝑇 (𝒙𝑖 𝝁𝑖 )(𝒙𝑖 𝝁𝑖 )𝑇 𝒘𝑘𝑘 1 𝑖 1𝑑𝒘𝑘 𝑇 𝑺𝑡 𝒘𝑘 arg max tr[𝑾𝑇 𝑺𝑡 𝑾] arg max𝑾17𝑘 1Stefanos Zafeiriou𝑾Adv. Statistical Machine Learning (course 495)

PCA (more dimensions)𝑾𝑜 arg max tr[𝑾𝑇 𝑺𝑡 𝑾]𝑾s.t. 𝑾𝑇 𝑾 𝑰Lagrangian𝐿 𝐖, 𝜦 tr[𝐖 𝑇 𝑺𝑡 𝐖]-tr[𝜦(𝐖 𝑇 𝐖 𝐈)] tr[𝐖 𝑇 𝑺𝑡 𝐖] 2𝑺𝑡 𝑾 𝑾 L(𝑾, 𝜦) 𝟎 𝑾18 tr[𝜦(𝐖 𝑇 𝐖 𝐈)] 2𝑾𝚲 𝑾𝑺𝒕 𝑾 𝑾𝜦Stefanos ZafeiriouDoes it ring a bell?Adv. Statistical Machine Learning (course 495)

PCA (more dimensions) Hence, 𝑾 has as columns the d eigenvectors of 𝑺𝑡that correspond to its d largest nonzero eigenvalues𝑾 𝑼𝒅tr[𝑾𝑇 𝑺𝑡 𝑾] tr[𝑾𝑇 𝑼𝚲𝑼𝑻 𝑾] tr[𝚲𝑑 ]Example: U be 5x5 and W be a 5x3𝑢1 . 𝑢1𝑾𝑇 𝑼 𝑢2 . 𝑢1𝑢3 . 𝑢119𝑢1 . 𝑢2𝑢2 . 𝑢2𝑢3 . 𝑢2𝑢1 . 𝑢3 𝑢1 . 𝑢4𝑢 2 . 𝑢3 𝑢2 . 𝑢4𝑢 3 . 𝑢3 𝑢3 . 𝑢4Stefanos Zafeiriou𝑢1 . 𝑢51𝑢2 . 𝑢5 0𝑢3 . 𝑢500 00 01 00 00 10 0Adv. Statistical Machine Learning (course 495)

PCA (more dimensions)𝜆1 0 0 0 00 𝜆2 0 0 0𝑾𝑇 𝑼𝚲 0 0 𝜆3 0 00 0 0 000 0 0 00and𝜆1𝑾𝑇 𝑼𝚲𝑼𝑻 𝑾 000𝜆20Hence the maximum is00𝜆3𝑑tr 𝚲𝑑 𝜆𝑑𝑖 120Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (another perspective) We want to find a set of bases 𝑾 that best reconstructs thedata after projection𝒚𝑖 𝑾𝑇 𝒙𝑖𝒙𝑖21Stefanos Zafeiriou𝒙𝑖 𝑾𝑾𝑇 𝒙𝑖Adv. Statistical Machine Learning (course 495)

PCA (another perspective)Let us assume for simplicity centred data (zero mean) Reconstructed data𝑿 𝑾𝒀 𝑾𝑻 𝑾𝑿1𝑾𝟎 arg min𝑁𝑾𝑁𝐹(𝑥𝑖𝑗 𝑥𝑖𝑗 )2𝑖 1 𝑗 1 arg min 𝑿 𝑾𝑾𝑻 𝑿 2 𝐹 (1)𝑾s. t. 𝑾𝑻 𝑾 𝑰22Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (another perspective)𝑿 𝑾𝑾𝑻 𝑿 tr 𝐗𝟐𝑭𝑇𝐓 𝐖𝐖 𝐗𝐗 𝐖𝐖 𝐓 𝐗 tr[𝐗 𝐓 𝐗 𝐗 𝐓 𝐖𝐖 𝑻 𝐗 𝐗 𝐓 𝐖𝐖 𝑻 𝐗 𝐗 𝐓 𝐖𝐖 𝑻 𝐖𝐖 𝑻 𝐗] tr𝐗𝐓𝐗 tr[𝐗 𝐓 𝐖𝐖 𝑻 𝐗]𝐈constantmin(1)23max tr[𝐖 𝑻 𝐗𝐗 𝐓 𝐖]𝑊Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (applications) TEXTURE Mean 1st PC24Stefanos Zafeiriou 2nd PC 3rd PC 4th PCAdv. Statistical Machine Learning (course 495)

PCA (applications) SHAPE Mean 1st PC25Stefanos Zafeiriou 2nd PC 3rd PC 4th PCAdv. Statistical Machine Learning (course 495)

PCA (applications)Identity26Stefanos ZafeiriouExpressionAdv. Statistical Machine Learning (course 495)

Linear Discriminant Analysis PCA: Unsupervised approach good for compressionof data and data reconstruction. Good statistical prior. PCA: Not explicitly defined for classification problems(i.e., in case that data come with labels) How do we define a latent space it this case? (i.e.,that helps in data classification)27Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant Analysis We need to properly define statistical propertieswhich may help us in classification. Intuition: We want to find a space in which(a) the data consisting each class look more likeeach other, while(b) the data of separate classes look moredissimilar.28Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant Analysis How do I make my data in each class look moresimilar? Minimize the variability in each class(minimize the variance)Space of 𝒙𝜎𝑦𝜎𝑦22𝑐2𝑐1Space of 𝒚29Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant Analysis How do I make the data between classes lookdissimilar? I move the data from different classes furtheraway from each other (increase the distance betweentheir means).Space of 𝒙(𝜇𝑦 (𝑐1 ) 𝜇𝑦 (𝑐2 ))230Stefanos ZafeiriouSpace of 𝒚Adv. Statistical Machine Learning (course 495)

Linear Discriminant AnalysisA bit more formally. I want a latent space 𝑦 such that:𝜎𝑦2𝑐1 𝜎𝑦2𝑐2 is minimum(𝜇𝑦 (𝑐1 ) 𝜇𝑦 (𝑐2 ))2 is maximumHow do I combine them together?minimize𝜎𝑦Or maximize312𝑐1 𝜎𝑦2𝑐2(𝜇𝑦 (𝑐1 ) 𝜇𝑦 (𝑐2 ))2(𝜇𝑦 (𝑐1 ) 𝜇𝑦 (𝑐2 ))2𝜎𝑦 2 𝑐1 𝜎𝑦 2 𝑐2Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant AnalysisHow can I find my latent space?𝑦𝑖 𝒘𝑇 𝒙𝑖32Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant Analysis𝜎𝑦2𝑐11 𝑁𝑐1 1𝑁𝑐11 𝑁𝑐1 𝒘𝑇(𝑦𝑖 𝜇𝑦 𝑐1 )2𝝁 𝑐1 𝑥𝑖 𝑐1𝑥𝑖 𝑐1 (𝒘𝑇 (𝒙𝑖 𝝁 𝑐1 ))21𝑁𝑐1𝒙𝑖 𝑐1 𝒙𝑖𝒘𝑇 (𝒙𝑖 𝝁 𝑐1 )(𝒙𝑖 𝝁 𝑐1 )𝑇 𝒘𝑥𝑖 𝑐11𝑁𝑐1(𝒙𝑖 𝝁 𝑐1 )(𝒙𝑖 𝝁 𝑐1 )𝑇 𝒘𝑥𝑖 𝑐1 𝒘𝑇 𝑺1 𝒘𝜎𝑦332𝑐2 𝒘𝑇 𝑺2 𝒘Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant Analysis𝜎𝑦2𝑐1 𝜎𝑦2𝑐2 𝒘𝑇 (𝑺1 𝑺2 )𝒘𝑺𝑤 within class scatter matrix(𝜇𝑦 (𝑐1 ) 𝜇𝑦 (𝑐2 ))2 𝒘𝑇 (𝝁 𝑐1 𝝁(𝑐2 ))(𝝁 𝑐1 𝝁 (𝑐2 ))𝑇 𝒘𝑺𝑏 between class scatter matrix(𝜇𝑦 (𝑐1 ) 𝜇𝑦 (𝑐2 ))2𝒘𝑇 𝑺𝑏 𝒘 𝑇22𝜎𝑦𝑐1 𝜎𝑦𝑐2𝒘 𝑺𝑤 𝒘34Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant Analysismax 𝒘𝑇 𝑺𝑏 𝒘 s.t. 𝒘𝑇 𝑺𝑤 𝒘 1Lagrangian: 𝐿 𝒘, λ 𝒘𝑇 𝑺𝑏 𝐰 λ(𝒘𝑻 𝑺𝒘 𝒘-1) 𝒘𝑇 𝑺𝑤 𝒘 2𝑺𝑤 𝒘 𝒘 𝐿 𝟎 𝒘 𝒘𝑇 𝑺𝑏 𝒘 2𝑺𝑏 𝒘 𝒘λ𝑺𝑤 𝒘 𝑺𝑏 𝒘𝒘 is the largest eigenvector of 𝑺𝑤 1𝑺𝑏𝐰 𝑺𝑤 1 (𝝁 𝑐1 𝝁 𝑐2 )35Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant AnalysisCompute the LDA projection for the following 2D dataset𝑐1 𝑐2 364,1 , 2,4 , 2,3 , 3,6 , (4,4)9,10 , 6,8 , 9,5 , 8,7 , (10,8)Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant AnalysisSolution (by hand) The class statistics are0.8 0.4 𝑺 [ 1.84 0.04]𝑺1 [] 2 0.04 2.64 0.4 2.64 The within and between class scatter are𝝁1 [3.0 3.6]𝑇29.16 21.6]21.6 16.0𝑺𝑏 [37Stefanos Zafeiriou𝝁2 [8.4 7.6]𝑇𝑺𝑤 [2.64 0.44] 0.44 5.28Adv. Statistical Machine Learning (course 495)

Linear Discriminant AnalysisThe LDA projection is then obtained as the solution of thegeneralized eigenvalue problem𝑺𝑤 1 𝑺𝑏 𝒘 λ𝒘 𝑺𝑤 1 𝑺𝑏 λ𝑰 0 11.89 λ8.81 0 λ 15.655.083.76 λ𝑤1𝑤111.89 8.81 𝑤10.91 15.65 𝑤2𝑤25.08 3.76 𝑤20.39Or directly by𝒘 𝑺𝑊 1 𝝁1 𝝁2 [ 0.91 0.39]𝑇38Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

LDA (Multiclass & Multidimensional case)𝑾 [𝒘1 , , 𝒘𝑑 ]39Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

LDA (Multiclass & Multidimensional case)Within-class scatter matrix𝐶𝑺𝑤 𝐶𝑺𝑗 𝑗 1𝑗 11𝑁𝑐𝑗(𝒙𝑖 𝝁 𝑐𝑗 )(𝒙𝑖 𝝁 𝑐𝑗 )𝑇𝑥𝑖 𝑐𝑗Between-class scatter matrix𝑐(𝝁 𝑐𝑗 𝝁) (𝝁 𝑐𝑗 𝝁)Τ𝑺𝑏 𝑗 140Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

LDA (Multiclass & Multidimensional case)max tr[𝑾T 𝑺𝑏 𝑾] s.t. 𝑾𝑇 𝑺𝑤 𝑾 ILagranging: 𝐿 𝑾, 𝚲 𝑡𝑟[𝒘𝑇 𝑺𝑏 𝐰] 𝑡𝑟[𝚲(𝑾𝑻 𝑺𝑤 𝑾 𝑰)] tr[𝐖 𝑇 𝑺𝑏 𝐖] tr[𝜦(𝐖 𝑇 𝑺𝑤 𝐖 𝐈)] 2𝑺𝑏 𝑾 2𝑺𝑤 𝑾𝚲 𝑾 𝑾 L(𝑾, 𝜦)𝑺𝑏 𝑾 𝑺𝑤 𝑾𝚲 𝟎 𝑾the eigenvectors of 𝑺𝑤 1 𝑺𝑏 that correspond tothe largest eigenvalues41Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) Pattern Recognition & Machine Learning by C. Bishop Chapter 12 Turk, Matthew, and Alex Pentland. "Eigenfaces for recognition." Journal of cognitive neuroscience 3.1 (1991): 71-86. Belhumeur, Peter N., João P. Hespanha, and David J. Kriegman.