Transcription
Course 495: Advanced Statistical MachineLearning/Pattern RecognitionDeterministic Component Analysis Goal (Lecture): To present standard and modern ComponentAnalysis (CA) techniques such as Principal ComponentAnalysis (PCA), Linear Discriminant Analysis (LDA),Graph/Neighbourhood based Component Analysis Goal (Tutorials): To provide the students the necessarymathematical tools for deeply understanding the CAtechniques.1Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
Materials Pattern Recognition & Machine Learning by C. Bishop Chapter 12 tion." Journal of cognitive neuroscience 3.1 (1991): 71-86. Belhumeur, Peter N., JoΓ£o P. Hespanha, and David J. Kriegman."Eigenfaces vs. fisherfaces: Recognition using class specific linearprojection." Pattern Analysis and Machine Intelligence, IEEETransactions on 19.7 (1997): 711-720. He, Xiaofei, et al. "Face recognition using laplacianfaces." PatternAnalysis and Machine Intelligence, IEEE Transactions on 27.3 (2005):328-340. Belkin, Mikhail, and Partha Niyogi. "Laplacian eigenmaps fordimensionality reduction and data representation." Neuralcomputation 15.6 (2003): 1373-1396.2Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
Deterministic Component AnalysisProblem: Given a population of data π1 , , ππ π πΉ (i.e. observations) find alatent space π1 , , ππ π π (usually d πΉ) which is relevant to a task.π13π2π3ππStefanos Zafeiriouπ1 π2 π3 ππAdv. Statistical Machine Learning (course 495)
Latent Variable ModelsLinear:Parameters:π πΉπNon-linear:ππΉ ππ4Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
What to have in mind? What are the properties of my latent space? How do I find it (linear, non-linear)? Which is the cost function? How do I solve the problem?5Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
A first example Letβs assume that we want to find a descriptive latentspace, i.e. best describes my population as a whole. How do I define it mathematically? Idea! This is a statistically machine learning course,isnβt? Hence, I will try to preserve global statisticalproperties.6Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
A first example What are the data statistics that can used to describethe variability of my observations? One such statistic is the variance of the population(how much the data deviate around a mean). Attempt: I want to find a low-dimensional latentspace where the βmajorityβ of variance is preserved(or in other words maximized).7Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
PCA (one dimension) Variance of the latent space {π¦1 , π¦2 , , π¦π }ππ¦21 πππ(π¦π ππ¦ )2π 1 We want to findππ¦ π¦ππ 1{π¦1 π , π¦π π } argmππ₯ ππ¦ 2{π¦1 , ,π¦π } But we are missing something The way to do it. Via a linear projection πi.e., π¦π ππ ππ8Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
PCA (geometric interpretation of projection)ππ πcos(π) π π ππ ππ ππ ππ πππππ π π cos(ΞΈ) π 9Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
PCA (one dimension)10Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
PCA (one dimension)Assuming π¦π ππ ππ latent spaceπ¦1 , , π¦π {ππ π1 , , ππ ππ }π0 argmax π 2π1 arg maxπ€ π(ππ ππ ππ π)21 argmax(ππ (ππ π))2ππ1 argmaxππ (ππ π) (ππ π)π πππ1 π argmax π(ππ π) (ππ π)π πππ argmax ππ ππ‘ π1π πππππ 1π11Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
PCA (one dimension)ππ argmax π 2π1πΊπ‘ π1 π argmax π ππ‘ π 0ππ(ππ π) (ππ π)π There is a trivial solution of π We can avoid it by adding extra constraints (a fixedmagnitude on π ( π 2 ππ π° 1)π0 arg max ππ πΊπ‘ ππsubject to s. t.12Stefanos Zafeiriouππ π 1Adv. Statistical Machine Learning (course 495)
PCA (one dimension)Formulate the LagrangianπΏ π, Ξ» ππ πΊπ‘ π Ξ»(ππ π 1) ππ πΊπ‘ π 2πΊπ‘ π π πΏ π π ππ π 2π ππΊπ‘ π Ξ»ππ is the largest eigenvector of πΊπ‘13Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
PCA Properties of πΊπ‘1πΊπ‘ πππ πππ ππ 1πΏπΏπ»ππΏ [π1 π, , ππ π]πΊπ‘ is a symmetric matrixall eigenvalues are realπΊπ‘ is a positive semi-definite matrix, i.e. π πππ πΊπ‘ w 0 (all eigenvalues are non negative)rank πΊπ‘ min(π 1, πΉ)πΊπ‘ πΌπ¦πΌπ14πΌπ πΌ π° , πΌπΌπ» π°Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
PCA (more dimensions) How can we find a latent space with more than onedimensions? We need to find a set of projections {π1 , , ππ }π¦π1π1 π ππππ πΎπ πππ¦ππππ π πππππΎ [π1 , , ππ ]ππ15Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
PCA (more dimensions)16Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
PCA (more dimensions)Maximize the variance in each dimension1πΎπ arg maxπΎ π1 arg maxπΎ πππ(π¦ππ πππ )2π 1 π 1ππππ π (ππ ππ )(ππ ππ )π πππ 1 π 1πππ π πΊπ‘ ππ arg max tr[πΎπ πΊπ‘ πΎ] arg maxπΎ17π 1Stefanos ZafeiriouπΎAdv. Statistical Machine Learning (course 495)
PCA (more dimensions)πΎπ arg max tr[πΎπ πΊπ‘ πΎ]πΎs.t. πΎπ πΎ π°LagrangianπΏ π, π¦ tr[π π πΊπ‘ π]-tr[π¦(π π π π)] tr[π π πΊπ‘ π] 2πΊπ‘ πΎ πΎ L(πΎ, π¦) π πΎ18 tr[π¦(π π π π)] 2πΎπ² πΎπΊπ πΎ πΎπ¦Stefanos ZafeiriouDoes it ring a bell?Adv. Statistical Machine Learning (course 495)
PCA (more dimensions) Hence, πΎ has as columns the d eigenvectors of πΊπ‘that correspond to its d largest nonzero eigenvaluesπΎ πΌπ tr[πΎπ πΊπ‘ πΎ] tr[πΎπ πΌπ²πΌπ» πΎ] tr[π²π ]Example: U be 5x5 and W be a 5x3π’1 . π’1πΎπ πΌ π’2 . π’1π’3 . π’119π’1 . π’2π’2 . π’2π’3 . π’2π’1 . π’3 π’1 . π’4π’ 2 . π’3 π’2 . π’4π’ 3 . π’3 π’3 . π’4Stefanos Zafeiriouπ’1 . π’51π’2 . π’5 0π’3 . π’500 00 01 00 00 10 0Adv. Statistical Machine Learning (course 495)
PCA (more dimensions)π1 0 0 0 00 π2 0 0 0πΎπ πΌπ² 0 0 π3 0 00 0 0 000 0 0 00andπ1πΎπ πΌπ²πΌπ» πΎ 000π20Hence the maximum is00π3πtr π²π πππ 120Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
PCA (another perspective) We want to find a set of bases πΎ that best reconstructs thedata after projectionππ πΎπ ππππ21Stefanos Zafeiriouππ πΎπΎπ ππAdv. Statistical Machine Learning (course 495)
PCA (another perspective)Let us assume for simplicity centred data (zero mean) Reconstructed dataπΏ πΎπ πΎπ» πΎπΏ1πΎπ arg minππΎππΉ(π₯ππ π₯ππ )2π 1 π 1 arg min πΏ πΎπΎπ» πΏ 2 πΉ (1)πΎs. t. πΎπ» πΎ π°22Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
PCA (another perspective)πΏ πΎπΎπ» πΏ tr πππππ ππ ππ ππ π π tr[π π π π π ππ π» π π π ππ π» π π π ππ π» ππ π» π] trπππ tr[π π ππ π» π]πconstantmin(1)23max tr[π π» ππ π π]πStefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
PCA (applications) TEXTURE Mean 1st PC24Stefanos Zafeiriou 2nd PC 3rd PC 4th PCAdv. Statistical Machine Learning (course 495)
PCA (applications) SHAPE Mean 1st PC25Stefanos Zafeiriou 2nd PC 3rd PC 4th PCAdv. Statistical Machine Learning (course 495)
PCA (applications)Identity26Stefanos ZafeiriouExpressionAdv. Statistical Machine Learning (course 495)
Linear Discriminant Analysis PCA: Unsupervised approach good for compressionof data and data reconstruction. Good statistical prior. PCA: Not explicitly defined for classification problems(i.e., in case that data come with labels) How do we define a latent space it this case? (i.e.,that helps in data classification)27Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
Linear Discriminant Analysis We need to properly define statistical propertieswhich may help us in classification. Intuition: We want to find a space in which(a) the data consisting each class look more likeeach other, while(b) the data of separate classes look moredissimilar.28Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
Linear Discriminant Analysis How do I make my data in each class look moresimilar? Minimize the variability in each class(minimize the variance)Space of πππ¦ππ¦22π2π1Space of π29Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
Linear Discriminant Analysis How do I make the data between classes lookdissimilar? I move the data from different classes furtheraway from each other (increase the distance betweentheir means).Space of π(ππ¦ (π1 ) ππ¦ (π2 ))230Stefanos ZafeiriouSpace of πAdv. Statistical Machine Learning (course 495)
Linear Discriminant AnalysisA bit more formally. I want a latent space π¦ such that:ππ¦2π1 ππ¦2π2 is minimum(ππ¦ (π1 ) ππ¦ (π2 ))2 is maximumHow do I combine them together?minimizeππ¦Or maximize312π1 ππ¦2π2(ππ¦ (π1 ) ππ¦ (π2 ))2(ππ¦ (π1 ) ππ¦ (π2 ))2ππ¦ 2 π1 ππ¦ 2 π2Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
Linear Discriminant AnalysisHow can I find my latent space?π¦π ππ ππ32Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
Linear Discriminant Analysisππ¦2π11 ππ1 1ππ11 ππ1 ππ(π¦π ππ¦ π1 )2π π1 π₯π π1π₯π π1 (ππ (ππ π π1 ))21ππ1ππ π1 ππππ (ππ π π1 )(ππ π π1 )π ππ₯π π11ππ1(ππ π π1 )(ππ π π1 )π ππ₯π π1 ππ πΊ1 πππ¦332π2 ππ πΊ2 πStefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
Linear Discriminant Analysisππ¦2π1 ππ¦2π2 ππ (πΊ1 πΊ2 )ππΊπ€ within class scatter matrix(ππ¦ (π1 ) ππ¦ (π2 ))2 ππ (π π1 π(π2 ))(π π1 π (π2 ))π ππΊπ between class scatter matrix(ππ¦ (π1 ) ππ¦ (π2 ))2ππ πΊπ π π22ππ¦π1 ππ¦π2π πΊπ€ π34Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
Linear Discriminant Analysismax ππ πΊπ π s.t. ππ πΊπ€ π 1Lagrangian: πΏ π, Ξ» ππ πΊπ π° Ξ»(ππ» πΊπ π-1) ππ πΊπ€ π 2πΊπ€ π π πΏ π π ππ πΊπ π 2πΊπ π πΞ»πΊπ€ π πΊπ ππ is the largest eigenvector of πΊπ€ 1πΊππ° πΊπ€ 1 (π π1 π π2 )35Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
Linear Discriminant AnalysisCompute the LDA projection for the following 2D datasetπ1 π2 364,1 , 2,4 , 2,3 , 3,6 , (4,4)9,10 , 6,8 , 9,5 , 8,7 , (10,8)Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
Linear Discriminant AnalysisSolution (by hand) The class statistics are0.8 0.4 πΊ [ 1.84 0.04]πΊ1 [] 2 0.04 2.64 0.4 2.64 The within and between class scatter areπ1 [3.0 3.6]π29.16 21.6]21.6 16.0πΊπ [37Stefanos Zafeiriouπ2 [8.4 7.6]ππΊπ€ [2.64 0.44] 0.44 5.28Adv. Statistical Machine Learning (course 495)
Linear Discriminant AnalysisThe LDA projection is then obtained as the solution of thegeneralized eigenvalue problemπΊπ€ 1 πΊπ π Ξ»π πΊπ€ 1 πΊπ Ξ»π° 0 11.89 Ξ»8.81 0 Ξ» 15.655.083.76 Ξ»π€1π€111.89 8.81 π€10.91 15.65 π€2π€25.08 3.76 π€20.39Or directly byπ πΊπ 1 π1 π2 [ 0.91 0.39]π38Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
LDA (Multiclass & Multidimensional case)πΎ [π1 , , ππ ]39Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
LDA (Multiclass & Multidimensional case)Within-class scatter matrixπΆπΊπ€ πΆπΊπ π 1π 11πππ(ππ π ππ )(ππ π ππ )ππ₯π ππBetween-class scatter matrixπ(π ππ π) (π ππ π)Ξ€πΊπ π 140Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
LDA (Multiclass & Multidimensional case)max tr[πΎT πΊπ πΎ] s.t. πΎπ πΊπ€ πΎ ILagranging: πΏ πΎ, π² π‘π[ππ πΊπ π°] π‘π[π²(πΎπ» πΊπ€ πΎ π°)] tr[π π πΊπ π] tr[π¦(π π πΊπ€ π π)] 2πΊπ πΎ 2πΊπ€ πΎπ² πΎ πΎ L(πΎ, π¦)πΊπ πΎ πΊπ€ πΎπ² π πΎthe eigenvectors of πΊπ€ 1 πΊπ that correspond tothe largest eigenvalues41Stefanos ZafeiriouAdv. Statistical Machine Learning (course 495)
Stefanos Zafeiriou Adv. Statistical Machine Learning (course 495) Pattern Recognition & Machine Learning by C. Bishop Chapter 12 Turk, Matthew, and Alex Pentland. "Eigenfaces for recognition." Journal of cognitive neuroscience 3.1 (1991): 71-86. Belhumeur, Peter N., JoΓ£o P. Hespanha, and David J. Kriegman.