Linear Algebra For Machine Learning - University At Buffalo

Transcription

Machine LearningSrihariLinear Algebra for MachineLearningSargur N. Sriharisrihari@cedar.buffalo.edu1

Machine LearningSrihariImportance of Linear Algebra in ML2

Machine LearningSrihariTopics in Linear Algebra for ML Why do we need Linear Algebra?From scalars to tensorsFlow of tensors in MLMatrix operations: determinant, inverseEigen values and eigen vectorsSingular Value DecompositionPrincipal components analysis3

Machine LearningSrihariWhat is linear algebra? Linear algebra is the branch of mathematicsconcerning linear equations such asa1x1 . anxn b– In vector notation we say aTx b– Called a linear transformation of x Linear algebra is fundamental to geometry, fordefining objects such as lines, planes, rotationsLinear equation a1x1 . anxn bdefines a plane in (x1,.,xn) spaceStraight lines define common solutionsto equations4

Machine LearningSrihariWhy do we need to know it? Linear Algebra is used throughout engineering– Because it is based on continuous math rather thandiscrete math Computer scientists have little experience with it Essential for understanding ML algorithms– E.g., We convert input vectors (x1,.,xn) into outputsby a series of linear transformations Here we discuss:– Concepts of linear algebra needed for ML– Omit other aspects of linear algebra5

Machine LearningLinear Algebra Topics– Scalars, Vectors, Matrices and Tensors– Multiplying Matrices and Vectors– Identity and Inverse Matrices– Linear Dependence and Span– Norms– Special kinds of matrices and vectors– Eigendecomposition– Singular value decomposition– The Moore Penrose pseudoinverse– The trace operator– The determinant– Ex: principal components analysisSrihari6

Machine LearningSrihariScalar Single number– In contrast to other objects in linear algebra,which are usually arrays of numbers Represented in lower-case italic x– They can be real-valued or be integers E.g., let x ! be the slope of the line– Defining a real-valued scalar E.g., let n ! be the number of units– Defining a natural number scalar7

Machine LearningVectorSrihari An array of numbers arranged in order Each no. identified by an index Written in lower-case bold such as x– its elements are in italics lower case, subscripted x 1 x2 x xn If each element is in R then x is in Rn We can think of vectors as points in space– Each element gives coordinate along an axis8

Machine LearningMatricesSrihari 2-D array of numbers– So each element identified by two indices Denoted by bold typeface A– Elements indicated by name in italic but not bold A1,1 is the top left entry and Am,n is the bottom right entry We can identify nos in vertical column j by writing : for thehorizontal coordinate E.g., A A A A2,1 1,11,2A2,2 Ai: is ith row of A, A:j is jth column of A If A has shape of height m and width n with realm nA !values then9

Machine LearningSrihariTensor Sometimes need an array with more than twoaxes– E.g., an RGB color image has three axes A tensor is an array of numbers arranged on aregular grid with variable number of axes– See figure next Denote a tensor with this bold typeface: A Element (i,j,k) of tensor denoted by Ai,j,k10

Machine LearningDimensions of TensorsSrihariFibers of a 3rd order tensor0d-tensor(scalar)Slices of a 3rd order tensor11

Machine LearningSrihariNumpy library in Python for tensors– Zero-dimensional tensor import numpy as npx np.array(100)print(“Array:”, x)print(“Dimension:”, x.ndim) OutputArray: 100Dimension 0– One-dimensional tensor import numpy as npx np.array([1,5,2,7,11,24,25,12])print(“Array:”, x)print(“Dimension:”, x.ndim) OutputArray: [ 1 5 2 7 11 24 25 12]Dimension 1– Two-dimensional tensor import numpy as npx ) print(“Array:”, x)print(“Dimension:”, x.ndim) OutputArray: [[ 1 5 2 7 11 24 25 12] [ 1 2 3 45 6 7 8]]Dimension 212

Machine LearningSrihariTranspose of a Matrix An important operation on matrices The transpose of a matrix A is denoted as AT Defined as(AT)i,j Aj,i– The mirror image across a diagonal line Called the main diagonal , running down to the rightstarting from upper left corner A 1,1A A2,1 A3,1 A1,2A2,2A3,2 AA1,3 1,1T A2,3 A A1,2 A1,3A3,3 A2,1A2,2A2,3A3,1 A3,2 A3,3 A 1,1A A2,1 A3,1 A1,2A2,2A3,2 A 1,1 AT A 1,2 A2,1A2,2A3,1 A3,2 13

Machine LearningSrihariVectors as special case of matrix Vectors are matrices with a single column Often written in-line using transposex [x1,.,xn]T x 1 x2 x xn T x x 1 ,x 2 ,.x n A scalar is a matrix with one elementa aT14

Machine LearningSrihariMatrix Addition We can add matrices to each other if they havethe same shape, by adding correspondingelements– If A and B have same shape (height m, width n)C A B C i,j Ai,j Bi,j A scalar can be added to a matrix or multipliedby a scalar D aB c D aB c Less conventional notation used in ML:i,j– Vector added to matrixi,jC A b C i,j Ai,j bj Called broadcasting since vector b added to each row of A15

Machine LearningSrihariMultiplying Matrices For product C AB to be defined, A has to havethe same no. of columns as the no. of rows of B– If A is of shape mxn and B is of shape nxp thenmatrix product C is of shape mxpC AB C i,j Ai,k Bk,jk– Note that the standard product of two matrices isnot just the product of two individual elements Such a product does exist and is called the element-wiseproduct or the Hadamard product A B16

Machine LearningSrihariMultiplying Vectors Dot product between two vectors x and y ofsame dimensionality is the matrix product xTy We can think of matrix product C AB ascomputing Cij the dot product of row i of A andcolumn j of B17

Machine LearningSrihariMatrix Product Properties Distributivity over addition: A(B C) AB ACAssociativity: A(BC) (AB)CNot commutative: AB BA is not always trueDot product between vectors is commutative:xTy yTx Transpose of a matrix product has a simpleform: (AB)T BTAT18

Machine LearningSrihariExample flow of tensors in MLA linear classifier y WxT bVector x is convertedinto vector y bymultiplying x by a matrix WA linear classifier with bias eliminated y WxT

Machine LearningSrihariLinear Transformation Ax b– where A ! n n and b ! n– More explicitly A x A x . A x b11 112 21n n1A21 x1 A22 x2 . A2n xn b2n equations inn unknownsAn1 x1 Am2 x2 . An,n xn bn A! A1,n 1,1A """ A n,1 ! Ann nxn x 1 x " x n nx1 b 1 b " b n Can view A as a linear transformationof vector x to vector bnx1 Sometimes we wish to solve for the unknownsx {x1,.,xn} when A and b provide constraints20

Machine LearningSrihariIdentity and Inverse Matrices Matrix inversion is a powerful tool to analyticallysolve Ax b Needs concept of Identity matrix Identity matrix does not change value of vectorwhen we multiply the vector by identity matrix– Denote identity matrix that preserves n-dimensionalvectors as In– FormallyandIn ! n n x ! n ,Inx x 1 0 0 – Example of I3 0 1 0 0 0 1 21

Machine LearningSrihariMatrix Inverse Inverse of square matrix A defined as We can now solve Ax b as follows:A 1 A InAx bA 1 Ax A 1bIn x A 1bx A 1b This depends on being able to find A-1 If A-1 exists there are several methods forfinding it22

Machine LearningSrihariSolving Simultaneous equations Ax bwhere A is (M 1) x (M 1)x is (M 1) x 1: set of weights to be determinedb is N x 123

Example: System of LinearEquations in Linear RegressionMachine LearningSrihari Instead of Ax b We have Φw t– where Φ is m x n design matrix of m features for nsamples xj, j 1,.n– w is weight vector of m values– t is target values of sample, t [t1,.tn]– We need weight w to be used with m features todetermine outputmy(x,w) wi x ii 124

Machine LearningSrihariClosed-form solutions Two closed-form solutions1.Matrix inversion x A-1b2.Gaussian elimination25

Machine LearningSrihariLinear Equations: Closed-Form Solutions1. Matrix Formulation: Ax bSolution: x A-1b2. Gaussian Eliminationfollowed by back-substitutionL2-3L1àL2L3-2L1àL3-L2/4àL2

Machine LearningSrihariDisadvantage of closed-form solutions If A-1 exists, the same A-1 can be used for anygiven b– But A-1 cannot be represented with sufficientprecision– It is not used in practice Gaussian elimination also has disadvantages– numerical instability (division by small no.)– O(n3) for n x n matrix Software solutions use value of b in finding x– E.g., difference (derivative) between b and output is27used iteratively

Machine LearningSrihariHow many solutions for Ax b exist? System of equations with– n variables and m equations is:A11 x1 A12 x2 . A1n xn b1A21 x1 A22 x2 . A2n xn b2Am1 x1 Am2 x2 . Amn xn bm Solution is x A-1b In order for A-1 to exist Ax b must have exactlyone solution for every value of b– It is also possible for the system of equations tohave no solutions or an infinite no. of solutions forsome values of b It is not possible to have more than one but fewer thaninfinitely many solutions– If x and y are solutions then z α x (1-α) y is asolution for any real α28

Machine LearningSrihariSpan of a set of vectors Span of a set of vectors: set of points obtainedby a linear combination of those vectors– A linear combination of vectors {v(1),., v(n)} withcoefficients ci is c v– System of equations is Ax b(i)ii A column of A, i.e., A:i specifies travel in direction i How much we need to travel is given by xiAx x i A:, i This is a linear combination of vectorsi– Thus determining whether Ax b has a solution isequivalent to determining whether b is in the spanof columns of A This span is referred to as column space or range of A

Machine LearningSrihariConditions for a solution to Ax b Matrix must be square, i.e., m n and allcolumns must be linearly independent– Necessary condition is n m For a solution to exist when A ! m n we require thecolumn space be all of ! m– Sufficient Condition If columns are linear combinations of other columns,mcolumn space is less than !– Columns are linearly dependent or matrix is singular For column space to encompass ! m at least one setof m linearly independent columns For non-square and singular matrices– Methods other than matrix inversion are used

Machine LearningSrihariUse of a Vector in Regression A design matrix– N samples, D features Feature vector has three dimensions This is a regression problem31

Machine LearningSrihariNorms Used for measuring the size of a vector Norms map vectors to non-negative values Norm of vector x [x1,.,xn]T is distance from originto x– It is any function f that satisfies:( )f x 0 x 0( ) ( ) Triangle Inequalityf (αx ) α f (x )f( x y) f x f y α R32

LP NormMachine Learning Definition:– L2 Normp x xi p i Srihari1p Called Euclidean norm– Simply the Euclidean distancebetween the origin and the point x– written simply as x – Squared Euclidean norm is same as xTx– L1 Norm22 22 8 2 2 Useful when 0 and non-zero have to be distinguished– Note that L2 increases slowly near origin, e.g., 0.12 0.01)– L Normx max x i Called max normi33

Machine LearningSrihariUse of norm in Regression Linear Regressionx: a vector, w: weight vectory(x,w) w0 w1x1 . wd xd wTxWith nonlinear basis functions ϕjy(x,w) w0 M 1 w φ (x)j 1j j Loss FunctionN1λ!E(w) {y(x n ,w) tn }2 w 2 2 n 12Second term is a weighted normcalled a regularizer (to prevent overfitting)34

Machine LearningSrihariLP Norm and Distance Norm is the length of a vector We can use it to draw a unit circle from origin– Different P values yield different shapes Euclidean norm yields a circle Distance between two vectors (v,w)– dist(v,w) v-w (v w )211 . (vn wn )2Distance to origin would just be sqrt of sum of squares35

Machine LearningSrihariSize of a Matrix: Frobenius Norm Similar to L2 norm 2 1 5 A 0 2 1 3 1 1 1 2A Ai,j2 F i,j A 4 1 25 . 1 46 Frobenius in ML– Layers of neural networkinvolve matrix multiplication– Regularization:V matrixIW matrixJKI1 (I 1) V(I 1) J netJhj f(netj)f(x) 1/(1 e-x) minimize Frobenius of weightmatrices W(i) over L layers36

Machine LearningSrihariAngle between Vectors Dot product of two vectors can be written interms of their L2 norms and angle θ betweenthemTx y x 2 y 2 cos θ Cosine between two vectors is a measure oftheir similarity37

Machine LearningSpecial kind of Matrix: DiagonalSrihari Diagonal Matrix has mostly zeros, with nonzero entries only in diagonal– E.g., identity matrix, where all diagonal entries are 1– E.g., covariance matrix with independent featuresIf Cov(X,Y) 0 then E(XY) E(X)E(Y)N(x µ,Σ) 1 11 (x µ)T Σ 1(x µ) exp 2 (2π)D/2 Σ 1/2

Machine LearningSrihariEfficiency of Diagonal Matrix diag (v) denotes a square diagonal matrix withdiagonal elements given by entries of vector v Multiplying vector x by a diagonal matrix isefficient– To compute diag(v)x we only need to scale each xiby vidiag( v)x v x Inverting a square diagonal matrix is efficient– Inverse exists iff every diagonal entry is nonzero, inwhich case diag (v)-1 diag ([1/v1,.,1/vn]T)

Machine LearningSrihariSpecial kind of Matrix: Symmetric A symmetric matrix equals its transpose: A AT– E.g., a distance matrix is symmetric with Aij Aji– E.g., covariance matrices are symmetric

Machine LearningSpecial Kinds of VectorsSrihari Unit Vector– A vector with unit normx 12 Orthogonal Vectors– A vector x and a vector y areorthogonal to each other if xTy 0 If vectors have nonzero norm, vectors at90 degrees to each other– Orthonormal Vectors Vectors are orthogonal & have unit norm Orthogonal Matrix– A square matrix whose rows are mutuallyorthonormal: ATA AAT IOrthogonal matrices are ofinterest because their inverse is– A-1 ATvery cheap to compute

Machine LearningMatrix decompositionSrihari Matrices can be decomposed into factors tolearn universal properties, just like integers:– Properties not discernible from their representation1.Decomposition of integer into prime factors From 12 2 2 3 we can discern that– 12 is not divisible by 5 or– any multiple of 12 is divisible by 3– But representations of 12 in binary or decimal are different2.Decomposition of Matrix A as A Vdiag(λ)V-1 where V is formed of eigenvectors and λ are eigenvalues,e.g, A 2 1 1 2 has eigenvalues λ 1 and λ 3 and eigenvectors V: vλ 1 1 ,vλ 3 1 1 1

Machine LearningSrihariEigenvector An eigenvector of a square matrixA is a non-zero vector v such thatmultiplication by A only changesthe scale of vAv λv– The scalar λ is known as eigenvalue If v is an eigenvector of A, so isany rescaled vector sv. Moreoversv still has the same eigen value.Thus look for a unit eigenvectorWikipedia43

Machine LearningSrihariEigenvalue and Characteristic Polynomial AL 1,1A M M An,1 L A1,n M Ann v 1v M v n w 1 w M w n Consider Av w If v and w are scalar multiples, i.e., if Av λv then v is an eigenvector of the linear transformation Aand the scale factor λ is the eigenvalue correspondingto the eigen vector This is the eigenvalue equation of matrix A– Stated equivalently as (A-λI)v 0– This has a non-zero solution if A-λI 0 The polynomial of degree n can be factored as A-λI (λ1-λ)(λ2-λ) (λn-λ) The λ1, λ2 λn are roots of the polynomial and areeigenvalues of Aas

Machine LearningSrihariExample of Eigenvalue/Eigenvector Consider the matrix 2 1 A 1 2 Taking determinant of (A-λI), the char poly is 2 λ1 A λI 2 λ 1 3 4λ λ 2 It has roots λ 1 and λ 3 which are the twoeigenvalues of A The eigenvectors are found by solving for v inAv λv, which are 1 1 ,v vλ 1 1 λ 3 1 45

Machine LearningSrihariEigendecomposition Suppose that matrix A has n linearlyindependent eigenvectors {v(1),.,v(n)} witheigenvalues {λ1,.,λn} Concatenate eigenvectors to form matrix V Concatenate eigenvalues to form vectorλ [λ1,.,λn] Eigendecomposition of A is given byA Vdiag(λ)V-146

Machine LearningSrihariDecomposition of Symmetric Matrix Every real symmetric matrix A can bedecomposed into real-valued eigenvectors andeigenvaluesA QΛQTwhere Q is an orthogonal matrix composed ofeigenvectors of A: {v(1),.,v(n)}orthogonal matrix: components are orthogonal or v(i)Tv(j) 0Λ is a diagonal matrix of eigenvalues {λ1,.,λn} We can think of A as scaling space by λi indirection v(i)–See figure on next slide47

Machine LearningSrihariEffect of Eigenvectors and Eigenvalues Example of 2 2 matrix Matrix A with two orthonormal eigenvectors– v(1) with eigenvalue λ1, v(2) with eigenvalue λ2Plot of unit vectors u !2(circle)Plot of vectors Au(ellipse)with two variables x1 and x248

Machine Learning Python Code e.com/watch?v mxkGMbrobY0&feature youtu.be&fbclid by vicyU2fg49

Machine LearningSrihariEigendecomposition is not unique Eigendecomposition is A QΛQT– where Q is an orthogonal matrix composed ofeigenvectors of A Decomposition is not unique when twoeigenvalues are the same By convention order entries of Λ in descendingorder:– Under this convention, eigendecomposition isunique if all eigenvalues are unique50

Machine LearningSrihariWhat does eigendecomposition tell us? Tells us useful facts about the matrix:1. Matrix is singular if & only if any eigenvalue is zero2. Useful to optimize quadratic expressions of formf(x) xTAx subject to x 2 1Whenever x is equal to an eigenvector, f is equal to thecorresponding eigenvalueMaximum value of f is max eigen value, minimum value ismin eigen valueExample of such a quadratic form appears in multivariateGaussian N(x µ,Σ) 111T 1 exp (x µ)Σ(x µ) D/21/2 (2π) Σ 2 51

Machine LearningSrihariPositive Definite Matrix A matrix whose eigenvalues are all positive iscalled positive definite– Positive or zero is called positive semidefinite If eigen values are all negative it is negativedefinite– Positive definite matrices guarantee that xTAx 052

Machine LearningSrihariSingular Value Decomposition (SVD) Eigendecomposition has form: A Vdiag(λ)V-1– If A is not square, eigendecomposition is undefined SVD is a decomposition of the form A UDVT SVD is more general than eigendecomposition– Used with any matrix rather than symmetric ones– Every real matrix has a SVD Same is not true of eigen decomposition

Machine LearningSVD DefinitionSrihari Write A as a product of 3 matrices: A UDVT– If A is m n, then U is m m, D is m n, V is n n Each of these matrices have a special structure U and V are orthogonal matrices D is a diagonal matrix not necessarily square– Elements of Diagonal of D are called singular values of A– Columns of U are called left singular vectors– Columns of V are called right singular vectors SVD interpreted in terms of eigendecomposition Left singular vectors of A are eigenvectors of AAT Right singular vectors of A are eigenvectors of ATA Nonzero singular values of A are square roots of eigenvalues of ATA. Same is true of AAT

Machine LearningSrihariUse of SVD in ML1. SVD is used in generalizing matrix inversion– Moore-Penrose inverse (discussed next)2. Used in Recommendation systems– Collaborative filtering (CF) Method to predict a rating for a user-item pair based on thehistory of ratings given by the user and given to the item Most CF algorithms are based on user-item rating matrixwhere each row represents a user, each column an item–Entries of this matrix are ratings given by users to items SVD reduces no.of features of a data set by reducing spacedimensions from N to K where K N55

Machine LearningSVDin Collaborative FilteringSrihari X is the utility matrix– Xij denotes how user i likes item j– CF fills blank (cell) in utility matrix that has no entry Scalability and sparsity is handled using SVD– SVD decreases dimension of utility matrix byextracting its latent factors Map each user and item into latent space of dimension r56

Machine LearningSrihariMoore-Penrose Pseudoinverse Most useful feature of SVD is that it can beused to generalize matrix inversion to nonsquare matrices Practical algorithms for computing thepseudoinverse of A are based on SVDA VD UT– where U,D,V are the SVD of A Pseudoinverse D of D is obtained by taking the reciprocalof its nonzero elements when taking transpose ofresulting matrix57

Machine LearningSrihariTrace of a Matrix Trace operator gives the sum of the elementsalong the diagonalTr(A ) Ai ,ii ,i Frobenius norm of a matrix can be representedas()A Tr(A)F1258

Machine LearningSrihariDeterminant of a Matrix Determinant of a square matrix det(A) is amapping to a scalar It is equal to the product of all eigenvalues ofthe matrix Measures how much multiplication by thematrix expands or contracts space59

Machine LearningSrihariExample: PCA A simple ML algorithm is Principal ComponentsAnalysis It can be derived using only knowledge of basiclinear algebra60

Machine LearningSrihariPCA Problem Statement Given a collection of m points {x(1),.,x(m)} in Rnrepresent them in a lower dimension.– For each point x(i) find a code vector c(i) in Rl– If l is smaller than n it will take less memory tostore the points– This is lossy compression– Find encoding function f (x) c and a decodingfunction x g ( f (x) )61

Machine LearningSrihariPCA using Matrix multiplication One choice of decoding function is to usematrix multiplication: g(c) Dc whereD ! n l– D is a matrix with l columns To keep encoding easy, we require columns ofD to be orthogonal to each other– To constrain solutions we require columns of D tohave unit norm We need to find optimal code c* given D Then we need optimal D62

Machine LearningSrihariFinding optimal code given D To generate optimal code point c* given input x,minimize the distance between input point xand its reconstruction g(c*)c* argmin x g(c)c2– Using squared L2 instead of L2, function beingminimized is equivalent to(x g(c))T (x g(c)) Using g(c) Dc optimal code can be shown to beequivalent toc* argmin 2x T Dc c T cc63

Machine LearningSrihariOptimal Encoding for PCA Using vector calculus c ( 2x T Dc c T c) 0 2DT x 2c 0c DT x Thus we can encode x using a matrix-vectoroperation– To encode we use f(x) DTx– For PCA reconstruction, since g(c) Dc we user(x) g(f(x)) DDTx– Next we need to choose the encoding matrix D64

Machine LearningSrihariMethod for finding optimal D Revisit idea of minimizing L2 distance betweeninputs and reconstructions– But cannot consider points in isolation– So minimize error over all points: Frobenius norm subject toDTD Il(( )) D* argmin x (ij ) r x (i )D i,j Use design matrix X,2j 12X ! m n– Given by stacking all vectors describing the points To derive algorithm for finding D* start byconsidering the case l 1– In this case D is just a single vector d65

Machine LearningSrihariFinal Solution to PCA For l 1, the optimization problem is solvedusing eigendecomposition– Specifically the optimal d is given by theeigenvector of XTX corresponding to the largesteigenvalue More generally, matrix D is given by the leigenvectors of X corresponding to the largesteigenvalues (Proof by induction)66

Machine Learning Srihari 1 Linear Algebra for Machine Learning Sargur N. Srihari srihari@cedar.buffalo.edu. Machine Learning Srihari . Numpy library in Python for tensors -Zero-dimensional tensor import numpy as np x np.array(100) print("Array:", x) print("Dimension:", x.ndim)