Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Transcription

Course 495: Advanced Statistical MachineLearning/Pattern RecognitionDeterministic Component Analysis Goal (Lecture): To present standard and modern ComponentAnalysis (CA) techniques such as Principal ComponentAnalysis (PCA), Linear Discriminant Analysis (LDA),Graph/Neighbourhood based Component Analysis Goal (Tutorials): To provide the students the necessarymathematical tools for deeply understanding the CAtechniques.1Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Materials Pattern Recognition & Machine Learning by C. Bishop Chapter 12 tion." Journal of cognitive neuroscience 3.1 (1991): 71-86. Belhumeur, Peter N., JoΓ£o P. Hespanha, and David J. Kriegman."Eigenfaces vs. fisherfaces: Recognition using class specific linearprojection." Pattern Analysis and Machine Intelligence, IEEETransactions on 19.7 (1997): 711-720. He, Xiaofei, et al. "Face recognition using laplacianfaces." PatternAnalysis and Machine Intelligence, IEEE Transactions on 27.3 (2005):328-340. Belkin, Mikhail, and Partha Niyogi. "Laplacian eigenmaps fordimensionality reduction and data representation." Neuralcomputation 15.6 (2003): 1373-1396.2Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Deterministic Component AnalysisProblem: Given a population of data 𝒙1 , , 𝒙𝑁 𝑅𝐹 (i.e. observations) find alatent space π’š1 , , π’šπ‘ 𝑅𝑑 (usually d 𝐹) which is relevant to a task.𝒙13𝒙2𝒙3𝒙𝑁Stefanos Zafeiriouπ’š1 π’š2 π’š3 π’šπ‘Adv. Statistical Machine Learning (course 495)

Latent Variable ModelsLinear:Parameters:𝑑 𝐹𝑛Non-linear:π‘šπΉ π‘šπ‘›4Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

What to have in mind? What are the properties of my latent space? How do I find it (linear, non-linear)? Which is the cost function? How do I solve the problem?5Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

A first example Let’s assume that we want to find a descriptive latentspace, i.e. best describes my population as a whole. How do I define it mathematically? Idea! This is a statistically machine learning course,isn’t? Hence, I will try to preserve global statisticalproperties.6Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

A first example What are the data statistics that can used to describethe variability of my observations? One such statistic is the variance of the population(how much the data deviate around a mean). Attempt: I want to find a low-dimensional latentspace where the β€œmajority” of variance is preserved(or in other words maximized).7Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (one dimension) Variance of the latent space {𝑦1 , 𝑦2 , , 𝑦𝑁 }πœŽπ‘¦21 𝑁𝑁𝑁(𝑦𝑖 πœ‡π‘¦ )2𝑖 1 We want to findπœ‡π‘¦ 𝑦𝑖𝑖 1{𝑦1 π‘œ , 𝑦𝑛 π‘œ } argmπ‘Žπ‘₯ πœŽπ‘¦ 2{𝑦1 , ,𝑦𝑛 } But we are missing something The way to do it. Via a linear projection π’˜i.e., 𝑦𝑖 π’˜π‘‡ 𝒙𝑖8Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (geometric interpretation of projection)𝒙𝑇 π’˜cos(πœƒ) 𝒙 π’˜ 𝒙𝒙 𝒙𝑇 π’™π’˜ π’˜π‘‡ π’˜π’˜πœƒπ’™π‘‡ π’˜ 𝒙 cos(ΞΈ) π’˜ 9Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (one dimension)10Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (one dimension)Assuming 𝑦𝑖 π’˜π‘‡ 𝒙𝑖 latent space𝑦1 , , 𝑦𝑁 {π’˜π‘‡ 𝒙1 , , π’˜π‘‡ 𝒙𝑁 }π’˜0 argmax 𝜎 2π’˜1 arg max𝑀 𝑁(π’˜π‘‡ 𝒙𝑖 π’˜π‘‡ 𝝁)21 argmax(π’˜π‘‡ (𝒙𝑖 𝝁))2π‘π’˜1 argmaxπ’˜π‘‡ (𝒙𝑖 𝝁) (𝒙𝑖 𝝁)𝑇 π’˜π‘π’˜1 𝑇 argmax π’˜(𝒙𝑖 𝝁) (𝒙𝑖 𝝁)𝑇 π’˜π‘π’˜ argmax π’˜π‘‡ 𝐒𝑑 π’˜1𝝁 𝑁𝑁𝒙𝑖𝑖 1π’˜11Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (one dimension)π’˜π‘œ argmax 𝜎 2π’˜1𝑺𝑑 𝑁1 𝑇 argmax π’˜ 𝐒𝑑 π’˜ 0π‘π’˜(𝒙𝑖 𝝁) (𝒙𝑖 𝝁)𝑇 There is a trivial solution of π’˜ We can avoid it by adding extra constraints (a fixedmagnitude on π’˜ ( π’˜ 2 π’˜π‘‡ 𝐰 1)π’˜0 arg max π’˜π‘‡ 𝑺𝑑 π’˜π’˜subject to s. t.12Stefanos Zafeiriouπ’˜π‘‡ π’˜ 1Adv. Statistical Machine Learning (course 495)

PCA (one dimension)Formulate the Lagrangian𝐿 π’˜, Ξ» π’˜π‘‡ 𝑺𝑑 π’˜ Ξ»(π’˜π‘‡ π’˜ 1) π’˜π‘‡ 𝑺𝑑 π’˜ 2𝑺𝑑 π’˜ π’˜ 𝐿 𝟎 π’˜ π’˜π‘‡ π’˜ 2π’˜ π’˜π‘Ίπ‘‘ π’˜ Ξ»π’˜π’˜ is the largest eigenvector of 𝑺𝑑13Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA Properties of 𝑺𝑑1𝑺𝑑 𝑁𝒙𝑖 𝝁𝒙𝑖 𝝁𝑇 1𝑿𝑿𝑻𝑁𝑿 [𝒙1 𝝁, , 𝒙𝑁 𝝁]𝑺𝑑 is a symmetric matrixall eigenvalues are real𝑺𝑑 is a positive semi-definite matrix, i.e. π’˜ πŸŽπ’˜π‘‡ 𝑺𝑑 w 0 (all eigenvalues are non negative)rank 𝑺𝑑 min(𝑁 1, 𝐹)𝑺𝑑 π‘Όπœ¦π‘Όπ‘‡14𝑼𝑇 𝑼 𝑰 , 𝑼𝑼𝑻 𝑰Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (more dimensions) How can we find a latent space with more than onedimensions? We need to find a set of projections {π’˜1 , , π’˜π‘‘ }𝑦𝑖1π’˜1 𝑇 π’™π‘–π’šπ‘– 𝑾𝑇 π’™π‘–π‘¦π‘–π‘‘π’˜π‘‘ 𝑇 π’™π‘–π’šπ‘–π‘Ύ [π’˜1 , , π’˜π‘‘ ]𝒙𝑖15Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (more dimensions)16Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (more dimensions)Maximize the variance in each dimension1π‘Ύπ‘œ arg max𝑾 𝑁1 arg max𝑾 𝑁𝑑𝑁(π‘¦π‘–π‘˜ πœ‡π‘–π‘˜ )2π‘˜ 1 𝑖 1π‘‘π‘π’˜π‘˜ 𝑇 (𝒙𝑖 𝝁𝑖 )(𝒙𝑖 𝝁𝑖 )𝑇 π’˜π‘˜π‘˜ 1 𝑖 1π‘‘π’˜π‘˜ 𝑇 𝑺𝑑 π’˜π‘˜ arg max tr[𝑾𝑇 𝑺𝑑 𝑾] arg max𝑾17π‘˜ 1Stefanos Zafeiriou𝑾Adv. Statistical Machine Learning (course 495)

PCA (more dimensions)π‘Ύπ‘œ arg max tr[𝑾𝑇 𝑺𝑑 𝑾]𝑾s.t. 𝑾𝑇 𝑾 𝑰Lagrangian𝐿 𝐖, 𝜦 tr[𝐖 𝑇 𝑺𝑑 𝐖]-tr[𝜦(𝐖 𝑇 𝐖 𝐈)] tr[𝐖 𝑇 𝑺𝑑 𝐖] 2𝑺𝑑 𝑾 𝑾 L(𝑾, 𝜦) 𝟎 𝑾18 tr[𝜦(𝐖 𝑇 𝐖 𝐈)] 2π‘Ύπš² 𝑾𝑺𝒕 𝑾 π‘Ύπœ¦Stefanos ZafeiriouDoes it ring a bell?Adv. Statistical Machine Learning (course 495)

PCA (more dimensions) Hence, 𝑾 has as columns the d eigenvectors of 𝑺𝑑that correspond to its d largest nonzero eigenvalues𝑾 𝑼𝒅tr[𝑾𝑇 𝑺𝑑 𝑾] tr[𝑾𝑇 π‘Όπš²π‘Όπ‘» 𝑾] tr[πš²π‘‘ ]Example: U be 5x5 and W be a 5x3𝑒1 . 𝑒1𝑾𝑇 𝑼 𝑒2 . 𝑒1𝑒3 . 𝑒119𝑒1 . 𝑒2𝑒2 . 𝑒2𝑒3 . 𝑒2𝑒1 . 𝑒3 𝑒1 . 𝑒4𝑒 2 . 𝑒3 𝑒2 . 𝑒4𝑒 3 . 𝑒3 𝑒3 . 𝑒4Stefanos Zafeiriou𝑒1 . 𝑒51𝑒2 . 𝑒5 0𝑒3 . 𝑒500 00 01 00 00 10 0Adv. Statistical Machine Learning (course 495)

PCA (more dimensions)πœ†1 0 0 0 00 πœ†2 0 0 0𝑾𝑇 π‘Όπš² 0 0 πœ†3 0 00 0 0 000 0 0 00andπœ†1𝑾𝑇 π‘Όπš²π‘Όπ‘» 𝑾 000πœ†20Hence the maximum is00πœ†3𝑑tr πš²π‘‘ πœ†π‘‘π‘– 120Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (another perspective) We want to find a set of bases 𝑾 that best reconstructs thedata after projectionπ’šπ‘– 𝑾𝑇 𝒙𝑖𝒙𝑖21Stefanos Zafeiriou𝒙𝑖 𝑾𝑾𝑇 𝒙𝑖Adv. Statistical Machine Learning (course 495)

PCA (another perspective)Let us assume for simplicity centred data (zero mean) Reconstructed data𝑿 𝑾𝒀 𝑾𝑻 𝑾𝑿1π‘ΎπŸŽ arg min𝑁𝑾𝑁𝐹(π‘₯𝑖𝑗 π‘₯𝑖𝑗 )2𝑖 1 𝑗 1 arg min 𝑿 𝑾𝑾𝑻 𝑿 2 𝐹 (1)𝑾s. t. 𝑾𝑻 𝑾 𝑰22Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (another perspective)𝑿 𝑾𝑾𝑻 𝑿 tr π—πŸπ‘­π‘‡π“ 𝐖𝐖 𝐗𝐗 𝐖𝐖 𝐓 𝐗 tr[𝐗 𝐓 𝐗 𝐗 𝐓 𝐖𝐖 𝑻 𝐗 𝐗 𝐓 𝐖𝐖 𝑻 𝐗 𝐗 𝐓 𝐖𝐖 𝑻 𝐖𝐖 𝑻 𝐗] tr𝐗𝐓𝐗 tr[𝐗 𝐓 𝐖𝐖 𝑻 𝐗]𝐈constantmin(1)23max tr[𝐖 𝑻 𝐗𝐗 𝐓 𝐖]π‘ŠStefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

PCA (applications) TEXTURE Mean 1st PC24Stefanos Zafeiriou 2nd PC 3rd PC 4th PCAdv. Statistical Machine Learning (course 495)

PCA (applications) SHAPE Mean 1st PC25Stefanos Zafeiriou 2nd PC 3rd PC 4th PCAdv. Statistical Machine Learning (course 495)

PCA (applications)Identity26Stefanos ZafeiriouExpressionAdv. Statistical Machine Learning (course 495)

Linear Discriminant Analysis PCA: Unsupervised approach good for compressionof data and data reconstruction. Good statistical prior. PCA: Not explicitly defined for classification problems(i.e., in case that data come with labels) How do we define a latent space it this case? (i.e.,that helps in data classification)27Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant Analysis We need to properly define statistical propertieswhich may help us in classification. Intuition: We want to find a space in which(a) the data consisting each class look more likeeach other, while(b) the data of separate classes look moredissimilar.28Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant Analysis How do I make my data in each class look moresimilar? Minimize the variability in each class(minimize the variance)Space of π’™πœŽπ‘¦πœŽπ‘¦22𝑐2𝑐1Space of π’š29Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant Analysis How do I make the data between classes lookdissimilar? I move the data from different classes furtheraway from each other (increase the distance betweentheir means).Space of 𝒙(πœ‡π‘¦ (𝑐1 ) πœ‡π‘¦ (𝑐2 ))230Stefanos ZafeiriouSpace of π’šAdv. Statistical Machine Learning (course 495)

Linear Discriminant AnalysisA bit more formally. I want a latent space 𝑦 such that:πœŽπ‘¦2𝑐1 πœŽπ‘¦2𝑐2 is minimum(πœ‡π‘¦ (𝑐1 ) πœ‡π‘¦ (𝑐2 ))2 is maximumHow do I combine them together?minimizeπœŽπ‘¦Or maximize312𝑐1 πœŽπ‘¦2𝑐2(πœ‡π‘¦ (𝑐1 ) πœ‡π‘¦ (𝑐2 ))2(πœ‡π‘¦ (𝑐1 ) πœ‡π‘¦ (𝑐2 ))2πœŽπ‘¦ 2 𝑐1 πœŽπ‘¦ 2 𝑐2Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant AnalysisHow can I find my latent space?𝑦𝑖 π’˜π‘‡ 𝒙𝑖32Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant AnalysisπœŽπ‘¦2𝑐11 𝑁𝑐1 1𝑁𝑐11 𝑁𝑐1 π’˜π‘‡(𝑦𝑖 πœ‡π‘¦ 𝑐1 )2𝝁 𝑐1 π‘₯𝑖 𝑐1π‘₯𝑖 𝑐1 (π’˜π‘‡ (𝒙𝑖 𝝁 𝑐1 ))21𝑁𝑐1𝒙𝑖 𝑐1 π’™π‘–π’˜π‘‡ (𝒙𝑖 𝝁 𝑐1 )(𝒙𝑖 𝝁 𝑐1 )𝑇 π’˜π‘₯𝑖 𝑐11𝑁𝑐1(𝒙𝑖 𝝁 𝑐1 )(𝒙𝑖 𝝁 𝑐1 )𝑇 π’˜π‘₯𝑖 𝑐1 π’˜π‘‡ 𝑺1 π’˜πœŽπ‘¦332𝑐2 π’˜π‘‡ 𝑺2 π’˜Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant AnalysisπœŽπ‘¦2𝑐1 πœŽπ‘¦2𝑐2 π’˜π‘‡ (𝑺1 𝑺2 )π’˜π‘Ίπ‘€ within class scatter matrix(πœ‡π‘¦ (𝑐1 ) πœ‡π‘¦ (𝑐2 ))2 π’˜π‘‡ (𝝁 𝑐1 𝝁(𝑐2 ))(𝝁 𝑐1 𝝁 (𝑐2 ))𝑇 π’˜π‘Ίπ‘ between class scatter matrix(πœ‡π‘¦ (𝑐1 ) πœ‡π‘¦ (𝑐2 ))2π’˜π‘‡ 𝑺𝑏 π’˜ 𝑇22πœŽπ‘¦π‘1 πœŽπ‘¦π‘2π’˜ 𝑺𝑀 π’˜34Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant Analysismax π’˜π‘‡ 𝑺𝑏 π’˜ s.t. π’˜π‘‡ 𝑺𝑀 π’˜ 1Lagrangian: 𝐿 π’˜, Ξ» π’˜π‘‡ 𝑺𝑏 𝐰 Ξ»(π’˜π‘» π‘Ίπ’˜ π’˜-1) π’˜π‘‡ 𝑺𝑀 π’˜ 2𝑺𝑀 π’˜ π’˜ 𝐿 𝟎 π’˜ π’˜π‘‡ 𝑺𝑏 π’˜ 2𝑺𝑏 π’˜ π’˜Ξ»π‘Ίπ‘€ π’˜ 𝑺𝑏 π’˜π’˜ is the largest eigenvector of 𝑺𝑀 1𝑺𝑏𝐰 𝑺𝑀 1 (𝝁 𝑐1 𝝁 𝑐2 )35Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant AnalysisCompute the LDA projection for the following 2D dataset𝑐1 𝑐2 364,1 , 2,4 , 2,3 , 3,6 , (4,4)9,10 , 6,8 , 9,5 , 8,7 , (10,8)Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Linear Discriminant AnalysisSolution (by hand) The class statistics are0.8 0.4 𝑺 [ 1.84 0.04]𝑺1 [] 2 0.04 2.64 0.4 2.64 The within and between class scatter are𝝁1 [3.0 3.6]𝑇29.16 21.6]21.6 16.0𝑺𝑏 [37Stefanos Zafeiriou𝝁2 [8.4 7.6]𝑇𝑺𝑀 [2.64 0.44] 0.44 5.28Adv. Statistical Machine Learning (course 495)

Linear Discriminant AnalysisThe LDA projection is then obtained as the solution of thegeneralized eigenvalue problem𝑺𝑀 1 𝑺𝑏 π’˜ Ξ»π’˜ 𝑺𝑀 1 𝑺𝑏 λ𝑰 0 11.89 Ξ»8.81 0 Ξ» 15.655.083.76 λ𝑀1𝑀111.89 8.81 𝑀10.91 15.65 𝑀2𝑀25.08 3.76 𝑀20.39Or directly byπ’˜ π‘Ίπ‘Š 1 𝝁1 𝝁2 [ 0.91 0.39]𝑇38Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

LDA (Multiclass & Multidimensional case)𝑾 [π’˜1 , , π’˜π‘‘ ]39Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

LDA (Multiclass & Multidimensional case)Within-class scatter matrix𝐢𝑺𝑀 𝐢𝑺𝑗 𝑗 1𝑗 11𝑁𝑐𝑗(𝒙𝑖 𝝁 𝑐𝑗 )(𝒙𝑖 𝝁 𝑐𝑗 )𝑇π‘₯𝑖 𝑐𝑗Between-class scatter matrix𝑐(𝝁 𝑐𝑗 𝝁) (𝝁 𝑐𝑗 𝝁)΀𝑺𝑏 𝑗 140Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

LDA (Multiclass & Multidimensional case)max tr[𝑾T 𝑺𝑏 𝑾] s.t. 𝑾𝑇 𝑺𝑀 𝑾 ILagranging: 𝐿 𝑾, 𝚲 π‘‘π‘Ÿ[π’˜π‘‡ 𝑺𝑏 𝐰] π‘‘π‘Ÿ[𝚲(𝑾𝑻 𝑺𝑀 𝑾 𝑰)] tr[𝐖 𝑇 𝑺𝑏 𝐖] tr[𝜦(𝐖 𝑇 𝑺𝑀 𝐖 𝐈)] 2𝑺𝑏 𝑾 2𝑺𝑀 π‘Ύπš² 𝑾 𝑾 L(𝑾, 𝜦)𝑺𝑏 𝑾 𝑺𝑀 π‘Ύπš² 𝟎 𝑾the eigenvectors of 𝑺𝑀 1 𝑺𝑏 that correspond tothe largest eigenvalues41Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)

Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) Pattern Recognition & Machine Learning by C. Bishop Chapter 12 Turk, Matthew, and Alex Pentland. "Eigenfaces for recognition." Journal of cognitive neuroscience 3.1 (1991): 71-86. Belhumeur, Peter N., JoΓ£o P. Hespanha, and David J. Kriegman.