Matrix Calculus: Derivation And Simple Application

Transcription

Matrix Calculus:Derivation and Simple ApplicationHU, Pili March 30, 2012†AbstractMatrix Calculus[3] is a very useful tool in many engineering problems. Basic rules of matrix calculus are nothing more than ordinarycalculus rules covered in undergraduate courses. However, using matrix calculus, the derivation process is more compact. This documentis adapted from the notes of a course the author recently attends. Itbuilds matrix calculus from scratch. Only prerequisites are basic calculus notions and linear algebra operation.To get a quick executive guide, please refer to the cheat sheet insection(4).To see how matrix calculus simplify the process of derivation, pleaserefer to the application in section(3.4). †hupili [at] ie [dot] cuhk [dot] edu [dot] hkLast compile:April 24, 20121

HU, PiliMatrix CalculusContents1 Introductory Example32 Derivation2.1 Organization of Elements . . . . . .2.2 Deal with Inner Product . . . . . . .2.3 Properties of Trace . . . . . . . . . .2.4 Deal with Generalized Inner Product2.5 Define Matrix Differential . . . . . .2.6 Matrix Differential Properties . . . .2.7 Schema of Hanlding Scalar Function2.8 Determinant . . . . . . . . . . . . . .2.9 Vector Function and Vector Variable2.10 Vector Function Differential . . . . .2.11 Chain Rule . . . . . . . . . . . . . .44456789101113153 Application3.1 The 2nd Induced Norm of Matrix . . . . . . .3.2 General Multivaraite Gaussian Distribution .3.3 Maximum Likelihood Estimation of Gaussian3.4 Least Square Error Inference: a Comparison .1616182021.242424252525274 Cheat Sheet4.1 Definition . . . . . . . . . .4.2 Schema for Scalar Function4.3 Schema for Vector Function4.4 Properties . . . . . . . . . .4.5 Frequently Used Formula .4.6 Chain Rule . . . . . . . . .Acknowledgements28References28Appendix292

HU, Pili1Matrix CalculusIntroductory ExampleWe start with an one variable linear function:f (x) ax(1)To be coherent, we abuse the partial derivative notation: f a xExtending this function to be multivariate, we have:Xf (x) ai xi aT x(2)(3)iWhere a [a1 , a2 , . . . , an ]T and x [x1 , x2 , . . . , xn ]T . We first computepartial derivatives directly:P ( i ai xi ) f ak(4) xk xkfor all k 1, 2, . . . , n. Then we organize n partial derivatives in the followingway: f x1 a1 f a2 f x2 (5) . a x . . . an f xnThe first equality is by proper definition and the rest roots from ordinarycalculus rules.Eqn(5) is analogous to eqn(2), except the variable changes from a scalarto a vector. Thus we want to directly claim the result of eqn(5) withoutthose intermediate steps solving for partial derivatives separately. Actually,we’ll see soon that eqn(5) plays a core role in matrix calculus.Following sections are organized as follows: Section(2) builds commonly used matrix calculus rules from ordinarycalculus and linear algebra. Necessary and important properties of linear algebra is also proved along the way. This section is not organizedafterhand. All results are proved when we need them. Section(3) shows some applications using matrix calculus. Table(1)shows the relation between Section(2) and Section(3). Section(4) concludes a cheat sheet of matrix calculus. Note that thischeat sheet may be different from others. Users need to figure outsome basic definitions before applying the rules.3

HU, PiliMatrix CalculusTable 1: Derivation and Application CorrespondanceDerivation tion2.1Organization of ElementsFrom the introductary example, we already see that matrix calculus doesnot distinguish from ordinary calculus by fundamental rules. However, withbetter organization of elements and proving useful properties, we can simplify the derivation process in real problems.The author would like to adopt the following definition: fDefinition 1. For a scalar valued function f (x), the resulthas the same xsize with x. That is f f f x11 x12 . . . x1n f f f. f x21 x22 x2n (6) . x . . . f f f . xm1 xm2 xmn f a is also a 1-by-1 xmatrix. In eqn(5), x is a column vector(known as n-by-1 matrix) and the fresult a has the same size. xIn eqn(2), x is a 1-by-1 matrix and the resultExample 1. By this definition, we have: f f ( )T aTT x x(7)Note that we only use the organization definition in this example. Later we’llshow that with some matrix properties, this formula can be derived without fusingas a bridge. x2.2Deal with Inner ProductTheorem 1. If there’s a multivariate scalar function f (x) aT x, we have f a. x4

HU, PiliMatrix CalculusProof. See introductary example.Since aT x is scalar, we can write it equivalently as the trace of its own.Thus, Proposition 2. If there’s a multivariate scalar function f (x) Tr aT x , fwe have a. xTr [ ] is the operator to sum up diagonal elements of a matrix. In thenext section, we’ll explore more properties of trace. As long as we cantransform our target function into the form of theorem(1) or proposition(2),the result can be written out directly. Notice in proposition(2), a and x areboth vectors. We’ll show later as long as their sizes agree, it holds for matrixa and x.2.3Properties of TraceDefinition 2. Trace of square matrix is defined as: Tr [A] Pi AiiExample 2. Using definition(1,2), it is very easy to show: Tr [A] I A(8)since only diagonal elements are kept by the trace operator.Theorem 3. Matrix trace has the following properties: (1) Tr [A B] Tr [A] Tr [B] (2) Tr [cA] cTr [A] (3) Tr [AB] Tr [BA] (4) Tr [A1 A2 . . . An ] Tr [An A1 . . . An 1 ] P P (5) Tr AT B i j Aij Bij (6) Tr [A] Tr ATwhere A, B are matrices with proper sizes, and c is a scalar value.Proof. See wikipedia [5] for the proof.Here we explain the intuitions behind each property to make it easier to remenber. Property(1) and property(2) shows the linearity of trace.Property(3) means two matrices’ multiplication inside a the trace operatoris commutative. Note that the matrix multiplication without trace is notcommutative and the commutative property inside the trace does not hold5

HU, PiliMatrix Calculusfor more than 2 matrices. Property (4) is the proposition of property (3) byconsidering A1 A2 . . . An 1 as a whole. It is known as cyclic property, so thatyou can rotate the matrices inside a trace operator. Property (5) shows away to express the sum of element by element product using matrix productand trace. Note that inner product of two vectors is also the sum of element by element product. Property (5) resembles the vector inner productby form(AT B). The author regards property (5) as the extension of innerproduct to matrices(Generalized Inner Product).2.4Deal with Generalized Inner Product Theorem 4. If there’s a multivariate scalar function f (x) Tr AT x , we fhave A. (A, x can be matrices). xProof. Using property (5) of trace, we can write f as: Xf (x) Tr AT x Aij xij(9)ijIt’s easy to show:P ( ij Aij xij ) f Aij xij xij(10)Organize elements using definition(1), it is proved.With this theorem and properties of trace we revisit example(1).Example 3. For vector a, x and function f (x) aT x (f is scalar) (property(3)) (property(6)) (property of transpose) (theorem(4)) f xT (aT x) xT (Tr aT x )T x T (Tr xa )T x T (Tr ax )T x T T T (Tr (a ) x ) xTTa(11)(12)(13)(14)(15)(16)(17)The result is the same with example(1), where we used the basic definition.The above example actually demonstrates the usual way of handling amatrix derivative problem.6

HU, Pili2.5Matrix CalculusDefine Matrix DifferentialAlthough we want matrix derivative at most time, it turns out matrix differential is easier to operate due to the form invariance property of differential.Matrix differential inherit this property as a natural consequence of the following definition.Definition 3. Define matrix differential: dA11 dA12 . . . dA1n dA21 dA22 . . . dA2n dA . . . dAm1 dAm2 . . . dAmn(18)Theorem 5. Differential operator is distributive through trace operator:dTr [A] Tr [dA]Proof.LHS d(XAii ) XdAiiii dA12dA22.dA11 dA21 RHS Tr . .(19) dA1ndA2n . . (20)dAm1 dAm2 . . . dAmn XdAii LHS(21)iNow that matrix differential is well defined, we want to relate it backto matrix derivative. The scalar version differential and derivative can berelated as follows: fdf dx(22) x fSo far, we’re dealing with scalar function f and matrix variable x.and xdx are both matrix according to definition. In order to make the quantitiesin eqn(22) equal, we must figure out a way to make the RHS a scalar. It’snot surprising that trace is what we want.Theorem 6. f Tdf Tr ( ) dx xfor scalar function f and arbitrarily sized x.7(23)

HU, PiliMatrix CalculusProof.LHS dfX fdxij(definition of scalar differential) xijij f TRHS Tr ( ) dx xX f(trace property (5)) ( )ij (dx)ij x(24)(25)(26)(27)ij(definition(3)) X f( )ij dxij x(28)X fdxij xij(29)ij(definition(1)) ij LHS(30)Theorem(6) is the bridge between matrix derivative and matrix differential. We’ll see in later applications that matrix differential is more convenient to manipulate. After certain manipulation we can get the form oftheorem(6). Then we can directly write out matrix derivative using thistheorem.2.6Matrix Differential PropertiesTheorem 7. We claim the following properties of matrix differential: d(cA) cdA d(A B) dA dB d(AB) dAB AdBProof. They’re all natural consequences given the definition(3). We onlyshow the 3rd one in this document. Note that the equivalence holds if LHSand RHS are equivalent element by element. We consider the (ij)-th element.XLHSij d(Aik Bkj )(31)k X(dAik Bkj Aik dBkj )(32)kRHSij (dAB)ij (AdB)ijXX dAik Bkj Aik dBkjk(33)(34)k LHSij(35)8

HU, PiliMatrix CalculusExample 4. Given the function f (x) xT Ax, where A is square and x isa column vector, we can compute: df dTr xT Ax(36) T Tr d(x Ax)(37) TT Tr d(x )Ax x d(Ax)(38) TTT Tr d(x )Ax x dAx x Adx(39) T T(A is constant) Tr dx Ax x Adx(40) T T Tr dx Ax Tr x Adx(41) T T T Tr x A dx Tr x Adx(42) T T Tr x A dx xT Adx(43) T T T Tr (x A x A)dx(44)Using theorem(6), we obtain the derivative: f (xT AT xT A)T Ax AT x x(45)When A is symmetric, it simplifies to: f 2Ax x(46) (xT x) 2x x(47)Let A I, we have:Example 5. For a non-singular square matrix X, we have XX 1 I.Take matrix differentials at both sides:0 dI d(XX 1 ) dXX 1 Xd(X 1 )(48)Rearrange terms:d(X 1 ) X 1 dXX 12.7(49)Schema of Hanlding Scalar FunctionThe above example already demonstrates the general schema. Here we conclude the process:1. df dTr [f ] Tr [df ]2. Apply trace properties(see theorem(3)) and matrix differential properties(see theorem(7)) to get the following form: df Tr AT x(50)9

HU, PiliMatrix Calculus3. Apply theorem(6) to get: f A x(51)To this point, you can handle many problems. In this schema, matrixdifferential and trace play crucial roles. Later we’ll deduce some widely usedformula to facilitate potential applications. As you will see, although we relyon matrix differential in the schema, the deduction of certain formula maybe more easily done using matrix derivatives.2.8DeterminantFor a background of determinant, please refer to [6]. We first quote somedefinitions and properties without proof:Theorem 8. Let A be a square matrix: The minor Mij is obtained by remove i-th row and j-th column of Aand then take determinant of the resulting (n-1) by (n-1) matrix. The ij-th cofactor is defined as Cij ( 1)i j Mij . If we expand determinant with respect to the i-th row, det(A) PjAij Cij . The adjugate of A is defined as adj(A)ij ( 1)i j Mji Cji . Soadj(A) C T For non-singular matrix A, we have: A 1 adj(A)det(A) CTdet(A)Now we’re ready to show the derivative of determinant. Note that determinant is just a scalar function, so all techniques discussed above isapplicable. We first write the derivative element by element. Expandingdeterminant on the i-th row, we have:P ( j Aij Cij ) det(A) Cij(52) Aij AijFirst equality is from determinant definition and second equality is by theobservation that only coefficient of Aij is left. Grouping all elements usingdefinition(1), we have: det(A) C adj(A)T A(53)If A is non-singular, we have: det(A) (det(A)A 1 )T det(A)(A 1 )T A10(54)

HU, PiliMatrix CalculusNext, we use theorem(6) to give the differential relationship: det(A) Td det(A) Tr () dA A Tr (det(A)(A 1 )T )T dA Tr det(A)A 1 dA(55)(56)(57)In many practical problem, the log determinant is more widely used: ln det(A)1 det(A) (A 1 )T Adet(A) A(58)The first equality comes from chain rule of ordinary calculus( ln det(A) anddet(A) are both scalars). Similarly, we derive for differential: d ln det(A) Tr A 1 dA(59)2.9Vector Function and Vector VariableThe above sections show how to deal with scalar functions. In order to dealwith vector function, we should restrict our attention to vector variables.It’s no surprising that the tractable forms in matrix calculus is so scarse.If we allow matrix functions and matrix variables, given the fact that fullyspecification of all partial derivatives calls for a tensor, it will be difficult tovisualize the result on a paper. An alternative is to stretch functions andvariables such that they appear as vectors.An annoying fact of matrix calculus is that, when you try to find referencematerials, there are always two kinds of people. One group calculates as thetranspose of another. Many online resources are not coherent, which misleadpeople.We borrow the following definitions of Hessian matrix[7] and Jacobianmatrix[8] from Wikipedia: 2f 2f 2f···2 x x x xn121 x1 2f 2f 2f ···2 x2 x1 x x x2n 2 H(f ) (60) . . . . . 2f 2f 2f xn x1 xn x2 · · · x2n y1 x1 .J . ym x1 ···.···11 y1 xn . . ym xn(61)

HU, PiliMatrix CalculusNote two things about second order derivative: f 2f, people meanby By writting the abbreviation x2 x1 x2 x1convention. That is, first take derivative with respect to x1 and thentake derivative with respect to x2 . f(us xing definition(1) to organize), and then compute a vector-function-to fvector-variable derivative treatingas the function. xBearing this in mind, we find Hessian matrix and Jacobian matrix actually have contradictory notion of organization. In order to be coherent inthis document, we adopt the Hessian style. That is, each row correspondsto a variable, and each column corresponds to a function. To be concrete: The Hessian matrix can be regarded as first compute theDefinition 4. For a vector function f [f1 , f2 , . . . , fn ]T , and fi fi (x)where x [x1 , x2 , . . . , xm ]T , we have the following definition: f1 f2 fn x1 x1 . . . x1 f2 fn f1. f x2 x2 x2 (62) . x . . . f1 f2 fn . xm xm xmExample 6. According to definition(4), we revisit the definition of Hessianand Jacobian.Given twice differentiable function f (x), the Hessian is defined as: f(63)H(f ) x xTGiven two variables x and y, if we want to transform x into y, theJacobian is defined as: x T xJ det ( ) det(64) y yNote ”Jacobian” is the shorthand name of ”Jacobian determinant”, whichis the determinant of ”Jacobian matrix”. Due to the transpose invarianceof determinant, the second equality shows that it does not matter whichorganization method we use if we only want to do compute the Jacobian,rather than Jacobin matrix. However, if we’re to write out the Jacobinmatrix, this may be a pitfall depending on what organization of vector-tovector derivative we define.12

HU, Pili2.10Matrix CalculusVector Function DifferentialIn the previous sections, we relate matrix differential with matrix derivatvieusing theorem(6). This theorem bridges the two quantities using trace. Thuswe can handle the problem using preferred form intechangeably. (As is seen:calculating derivative of determinant is more direct; calculating differentialof inverse is tractable.)In this section, we provide another theorem to relate vector functiondifferential to vector-to-vector derivatives. Amazingly, it takes a cleanerform.Theorem 9. Consider the definition(4), we have: df ( f T) dx xProof. Apparently, df has n components, so we prove element by element.Consider the j-th component:LHSj dfjmX fj dxi xi(65)(66)i 1RHSj f T) x)j xmX f ( )Txi x ji ((i 1mX(67)(68) f)ij xi x(69)mX f ( )ij xi x(70)mX fj ()xi LHSj xi(71) i 1(i 1i 1Note that the trace operator is gone compared with theorem(6) due tothe nice way of defining matrix vector multiplication. We can have a similarschema of handling vector-to-vector derivatives using this scheme. We don’tbother to list the schema again. Instead, we provide an example.Example 7. Consider the variable transformation:x σΛ 0.5 W T ξ, whereσ is a real value, Λ is full rank diagonal matrix, and W is orthonormalsquare matrix (namelyW W T W T W I). Compute the absolute value ofJacobian.13

HU, PiliMatrix CalculusFirst, we want to find x. This can be easily done by computing the ξdifferential:dx d(σΛ 0.5 W T ξ) σΛ 0.5 W T dξ(72)Applying theorem(9), we have: x (σΛ 0.5 W T )T ξ(73)Thus,Jm ( x T) ((σΛ 0.5 W T )T )T σΛ 0.5 W T ξ(74)where Jm is the Jacobian matrix and J det(Jm ). Then we use someproperty of determinant to calculate the absolute value of Jacobian: J det(Jm ) p det(Jm ) det(Jm ) qT ) det(J ) det(JmmqT J ) det(Jm mqT ) det(Jm Jmq det(W Λ 0.5 σσΛ 0.5 W T ) q det(σ 2 W Λ 1 W T ) (75)(76)(77)(78)(79)(80)(81)If the dimension of x and ξ is d and we define Σ W ΛW T . A nice resultis calculated: J σ d det(Σ) 1/2(82)which we’ll see an application of generalizing the multivariate Gaussian distribution.Theorem 10. When f and x are of the same size, with definition(4):( f 1 x) x f(83)Proof. Using theorem(9):df(( f T 1) ) df x ( f T) dx x dxdx (( f 1 T) ) df x(84)(85)(86)(87)14

HU, PiliMatrix CalculusUsing theorem(9 in the reverse direction: f x ( ) 1 f x(88)This result is consistent with scalar derivative. It is useful in manipulating variable substitutions(the inverse of Jacobian is the Jacobian of reversesubstitution).2.11Chain RuleIn the schemas we conclude above, differential is convenient in many problems. For derivative, the nice aspect is that we have chain rule, which is ananalogy to the chain rule in ordinary calculus. However, one should be verycareful applying this chain rule, for the multiplication of matrix requiresdimension agreement.Theorem 11. Suppose we have n column vectors x(1) , x(2) , . . . , x(n) , eachis of length l1 , l2 , . . . , ln . We know x(i) is a function of x(i 1) , for all i 2, 3, . . . , n. The following relationship holds: x(n) x(2) x(3) x(n) . x(1) x(1) x(2) x(n 1)(89)Proof. Under definition(4), theorem(9) holds. We apply this theorem toeach consecutive vectors: x(2) T (1)) dx x(1) x(3) ( (2) )T dx(2) xdx(2) ((90)dx(3)(91).dx(n) ((92) x(n) x(n 1))T dx(n 1)(93)Plug previous one into next one, we have: x(n) T x(3) T x(2) T (1)).() ( (1) ) dx x(n 1) x(2) x x(2) x(3) x(n) T (1) ( (1) (2) . . . (n 1) ) dx x x xdx(n) ((94)(95)Applying theorem(9) again in the reverse direction, we have: x(n) x(2) x(3) x(n) . x(1) x(1) x(2) x(n 1)15(96)

HU, PiliMatrix CalculusProposition 12. Consider a chain of: x a scalar, y a column vector, z ascalar. z z(y), yi yi (x), i 1, 2 . . . , n. Apply the chain rule, we haven z y z X yi z x x y x yi(97)i 1Now we explain the intuition behind. x, z are both scalar, so we’rebasically calculating the derivative in ordinary calculus. Besides, we have a yi zgroup of ”bridge variables”, yi .is just the result of applying scalar x yichain rule on the chain:x yi z. The separate results of different bridgevariables are additive! To see why I make this proposition, interested readerscan refer to the comments in the corresponding LATEX source file.Example 8. Show the derivative of (x µ)T Σ 1 (x µ) to µ (for symmetricΣ 1 ). [(x µ)T Σ 1 (x µ)] µ [x µ] [(x µ)T Σ 1 (x µ)] µ [x µ] [x µ](example(4)) 2 Σ 1 (x µ) µ(d[x µ] I dµ) I 2 Σ 1 (x µ) 1 2Σ33.1(x µ)(98)(99)(100)(101)(102)ApplicationThe 2nd Induced Norm of MatrixThe induced norm of matrix is defined as [4]: A p maxx Ax p x p(103)where p denotes the p-norm of vectors. Now we solve for p 2. (Bydefault, means 2 )The problem can be restated as: A 2 maxx Ax 2 x 2(104)since all quantities involved are non-negative. Then we consider a scaling ofvector x0 tx, thus: A 2 max0x Ax0 2 tAx 2t2 Ax 2 Ax 2 max max max(105)xxx x0 2 tx 2t2 x 2 x 216

HU, PiliMatrix CalculusThis shows the invariance under scaling. Now we can restrict our attentionto those x with x 1, and reach the following formulation:Maximizes.t.f (x) Ax 22 x 1(106)(107)The standard way to handle this constrained optimization is using Lagrange relaxation:L(x) f (x) λ( x 2 1)(108)Then we apply the general schema of handling scalar function on L(x). Firsttake differential:dL(x) dTr [L(x)] Tr [d(L(x))] Tr d(xT AT Ax λ(xT x 1)) Tr 2xT AT Adx λ(2xT dx) Tr (2xT AT A 2λxT )dx(109)(110)(111)(112)(113)Next write out derivative:Let L 2AT Ax 2λx x(114)(AT A)x λx(115) L 0, we have: xThat means x is the eigen vector of (AT A) (normalized to x 1), and λis corresponding eigen value. We plug this result back to objective function:f (x) xT (AT Ax) xT (λx) λ(116)which means, to maximize f (x), we should pick the maximum eigen value: A 2 max f (x) λmax (AT A)(117)qλmax (AT A) σmax (A)(118)That is: A where σmax denotes the maximum singular value. If A is real symmetric,σmax (A) λmax (A).Now we consider a real symmetric A and check whether:λ2max (A) maxx Ax 2xT AT Ax maxx x 2xT x17(119)

HU, PiliMatrix CalculusProof. Since AT A is real symmetric, it has an orthnormal basis formed byn eigen vectors,Pv1 , v2 , . . . , vn , with eigen values λ1 λ2 · · · λn . Wecan write x i ci vi , where ci x, vi . Then,xT AT AxTPx x 2λi c(vk is an orthnormal set) Pi 2 icP i i2i λ1 ciP 2i ci λ1(120)(121)(122)(123)Now we have proved an upper bound for A 2 . We show this bound isachievable by assigning x v1 .3.2General Multivaraite Gaussian DistributionThe first time I came across multivariate Gaussian distribution is in mysophomore year. However, at that time, the multivariate version is optional,and the text book only gives us the formula rather than explaining anyintuition behind. I had trouble remembering the formula, since I don’tknow why it is the case.During the Machine Learning course[12] I take recently, the rationalebecomes clear to me. We’ll start from the basic notion, generalize it to multivariate isotropic Gaussian, and then use matrix calculus to derive the mostgeneral version. The following content is adapted from the correspondingcourse notes and homework exercises.We start from the univariate Gaussian[9]:(x µ)21G(x µ, σ 2 ) e 2σ2σ 2π(124)The intuition behind is: Use one centroid to represent a set of samples coming from the sameGaussian mixture, which is denoted by µ. Allow the existence of error. We express the reverse notion of ”error”by ”likelihood”. We want the density function describing the distribution that the farther away, the less likely a sample is drawn from themixture. We want this likelihood to decrease in a accelerated manner. Negative exponential is one proper candidate, say exp{ D(x, µ)},where D measures the distance between sample and centroid. To fit human’s intuition, the ”best” choice of distance measure is Euclidean distance, since we live in this Euclidean space. So far, we haveexp{ (x µ)2 }18

HU, PiliMatrix Calculus How fast the likelihood decreases should be controlled by a parameter,2say exp{ (x µ)}. σ can also be thought to control the uncertainty.σNow we already get Gaussian distribution. The division by 2 in the exponentis only to simplify deductions. Writting σ 2 instead of σ is to align with somestatistic quantities. The rest term is basically used to do normalization,which can be derived by extending the 1-D Gaussian integral to a 2-D areaintegral using Fubini’s thorem[11]. Interested reader can refer to [10].Now we’re ready to extend the univariate version to multivariate version.The above four characteristics appear as simple interpretation to everyonewho learned Gaussian distribution before. However, they’re the real ”axioms” behind. Now consider an isotropic Gaussian distribution and we startwith perpendicular axes. The uncertainty along each axis is the same, sothe distance measure is now x µ 22 (x µ)T (x µ). The exponent isexp{ 2σ1 2 (x µ)T (x µ)}. Integrating over the volume of x can be done bytransforming it into iterated form using Fubini’s theorem. Then we simplyapply the method that we deal with univariate Gaussian integral. Now wehave multivariate isotropic Gaussian distribution:G(x µ, σ 2 ) 11exp{ 2 (x µ)T (x µ)}2σ(2π)d/2 σ d(125)where d is the dimension of x, µ.We are just one step away from a general Gaussian. Suppose we have ageneral Gaussian, whose covariance is not isotropic nor does it all along thestandard Euclidean axes. Denote the sample from this Gaussian by ξ. Wecan first apply a rotation on ξ to bring it back to the standard axes, whichcan be done by left multiplying an orthongonal matrix W T . Then we scaleeach component of W ξ by different ratio to make them isotropic, which canbe done by left multiplying a diagonal matrix Λ 0.5 . The exponent 0.5 issimply to make later discussion convenient. Finally, we multiply it by σ tobe able to control the uncertainty at each direction again. The transformis given by x σΛ 0.5 W T ξ. Plugging this back to eqn(125), we derive thefollowing exponent:1(x µ)T (x µ)}2σ 21 exp{ 2 (σΛ 0.5 W T ξ µ)T (σΛ 0.5 W T ξ µ)}2σ1 exp{ 2 (ξ µξ )T σ 2 W Λ 1 W T (ξ µξ )}2σ1 exp{ (ξ µξ )T Σ 1 (ξ µξ )}2exp{ (126)(127)(128)(129)where we let µξ Eξ, and we plugged in the following result:µ Ex E[σΛ 0.5 W T ξ] σΛ 0.5 W T Eξ σΛ 0.5 W T µξ19(130)

HU, PiliMatrix CalculusNow we only need to find the normalization factor to make it a probabilitydistribution. Note eqn(125) integrates to 1, which is:ZZ11exp{ (x µ)T (x µ)}dx 1 (131)G(x µ, σ 2 )dx 2σ 2(2π)d/2 σ dTransforming variable from x to ξ causes a scaling by absolutewhich we already calculated in example(7). That is:Z11exp{ (ξ µξ )T Σ 1 (ξ µξ )} J dξ d/2d2(2π) σZ d 1/21σ det(Σ)exp{ (ξ µξ )T Σ 1 (ξ µξ )}dξ d/2d2(2π) σZ11exp{ (ξ µξ )T Σ 1 (ξ µξ )}dξ 2(2π)d/2 det(Σ)1/2Jacobian,1 (132)1 (133)1 (134)The term inside integral is just the general Gaussian density we want tofind:11G(x µ, Σ) exp{ (x µ)T Σ 1 (x µ)}(135)2(2π)d/2 det(Σ)1/23.3Maximum Likelihood Estimation of GaussianGiven N samples, xt , t 1, 2, . . . , N , independently indentically distributedthat are drawn from the following Gaussian distribution:G(xt µ, Σ) 1(2π)d/2 det(Σ)1/21exp{ (xt µ)T Σ 1 (xt µ)}2(136)solve the paramters θ {µ, Σ}, that maximize:p(X θ) NYG(xt θ)(137)t 1It’s more conveninent to handle the log likelihood as defined below:L ln p(X θ) NXln G(xt θ)(138)t 1We write each term of L out to facilitate further processing:NdN1XL ln(2π) ln Σ (xt µ)T Σ 1 (xt µ)222 t(139)Taking derivative of µ, the first two terms are gone and the third term isalready handled by example(8): L X 1 Σ (xt µ)(140) µt20

HU, PiliMatrix CalculusP L 0, we solve for µ N1 t xt . Then take derivative of Σ. It eaiser µto be handled using our trace schema:LetdL d[ N1Xln Σ ] d[ (xt µ)T Σ 1 (xt µ)]22 t(141)The first term is:d[ Nln Σ ]2Nd ln Σ 2 N (section(2.8)) Tr Σ 1 dΣ2 (142)(143)(144)The second term is:1X(xt µ)T Σ 1 (xt µ)](145)2 t"#X1T 1(xt µ) Σ (xt µ)(146) dTr2t#"X1T 1 dTr(xt µ)(xt µ) Σ(147)2t"#X1 Tr [(xt µ)(xt µ)T ]( Σ 1 dΣΣ 1 ) (148)2t"#X1Tr Σ 1 [(xt µ)(xt µ)T ]Σ 1 dΣ(149)2td[ (example(5)) Then we have:X LN1 Σ 1 Σ 1 [(xt µ)(xt µ)T ]Σ 1 Σ22t L 0, we solve for Σ ΣLet3.41NPt (xt(150) µ)(xt µ)T .Least Square Error Inference: a ComparisonThis application is rather simple. The purpose is to compare the derivationwith and without matrix calculus.3.4.1Problem DefinitionHere we consider a simple version:11Sid Jaggi, Dec 2011, Final Exam Q1 of CUHK ERG201321

HU, PiliMatrix CalculusGiven the linear system with noise y Ax n, where x is the input, y isthe output and n is the noise term. A as the system parameter is a knownp q matrix. Now we have one observation of output, ŷ. We want to inferthe ”most possible” corresponding input x̂ defined by the following formula:2x̂ arg min n 2(151)x:ŷ Ax n3.4.2Ordinary Formulation and DerivationFirst, we write the error function explicitly:f (x) ŷ Ax 2pX (ŷi (Ax)i )2i 1pX (ŷi i 1qXAij xj )2(152)(153)(154)j 1This is the quadratic form of all xk , k 1, 2 . . . q. We take derivative ofxk : f xk pX[2(ŷi i 1 2qXAij xj )( Aik )](155)j 1pX[(ŷi (Ax)i )Aik ](156)i 1 2 (ŷ (Ax)), Ak 2ATk (ŷ Ax)(157)(158)where Ak is the k-th column of A and ., . denotes the inner product of ftwo vectors. We let all 0 and put the q equations together to get the xkmatrix form: f x1 2AT f 1 (ŷ Ax) 2AT 2 (ŷ Ax) x2 (159) 2AT (ŷ Ax) 0. . . . 2AT f q (ŷ Ax) xq2Assuming Gaussian noise, the maximum likelihood inference results in the same expression of least squares. For sim

HU, Pili Matrix Calculus Table 1: Derivation and Application Correspondance Derivation Application 2.1-2.7 3.1 2.9,2.10 3.2 2.8,2.11 3.3 2 Derivation 2.1 Organization of Elements From the introductary example, we already see that matrix calculus does not distinguish from ordinary calculus by fundamental rules. However, with