Numerically Efficient Methods For Solving Least Squares Problems

Transcription

NUMERICALLY EFFICIENT METHODS FOR SOLVING LEASTSQUARES PROBLEMSDO Q LEEAbstract. Computing the solution to Least Squares Problems is of greatimportance in a wide range of fields ranging from numerical linear algebra toeconometrics and optimization. This paper aims to present numerically stableand computationally efficient algorithms for computing the solution to LeastSquares Problems. In order to evaluate and compare the stability and efficiencyof our proposed algorithms, the theoretical complexities and numerical resultshave been analyzed.Contents1. Introduction: The Least Squares Problem2. Existence and Uniqueness2.1. Least Squares Solution from Normal Equations3. Norms and Conditioning3.1. Norms: Quantifying Error and Distance3.2. Sensitivity and Conditioning: Perturbations to b3.3. Sensitivity to perturbations in A4. Normal Equations Method4.1. Cholesky Factorization4.2. Flop: Complexity of Numerical Algorithms4.3. Algorithm: Computing the Cholesky Factorization4.4. Shortcomings of Normal Equations5. Orthogonal Methods - The QR Factorization5.1. Orthogonal Matrices5.2. Triangular Least Squares Problems5.3. The QR Factorization in Least Squares Problems5.4. Calculating the QR-factorization - Householder Transformations5.5. Rank Deficiency: Numerical Loss of Orthogonality6. Singular Value Decomposition (SVD)6.1. The Minimum Norm Solution using SVD6.2. Computing the SVD of Matrix A7. Comparison of Methods8. AcknowledgementsReferencesDate: August 24, 2012.1223445677788899101012121314141515

2DO Q LEE1. Introduction: The Least Squares ProblemSuppose we are given a set of observed values β and α1 , . . . , αn . Suppose thevariable β is believed to have a linear dependence on the variables α1 , . . . , αn . Thenwe may postulate a linear modelβ x1 α1 . . . xn αn .Our goal is to determine the unknown coefficients x1 , . . . , xn so that the linearmodel is a best fit to our observed data. Now consider a system of m linear equationswith n variables: a11 x1 a12 x2 . . . a1n xn b1 a x a x . . . a x b21 122 22n n2 . am1 x1 am2 x2 . . . amn xn bmThen we obtain an overdetermined linear system Ax b, with m n matrixA, where m n. Since equality is usually not exactly satisfiable when m n,the Least Squares Solution x minimizes the squared Euclidean norm of the residualvector r(x) b Ax so that(1.1)min kr(x)k22 min kb Axk22In this paper, numerically stable and computationally efficient algorithms forsolving Least Squares Problems will be considered. Sections 2 and 3 will introduce the tools of orthogonality, norms, and conditioning which are necessary forunderstanding the numerical algorithms introduced in the following sections. TheNormal Equations Method using Cholesky Factorization will be discussed in detailin section 4. In section 5, we will see that the QR Factorization generates a moreaccurate solution than the Normal Equations Method, making it one of the mostimportant tools in computing Least Squares Solutions. Section 6 will discuss theSingular Value Decomposition (SVD) and its robustness in solving rank-deficientproblems. Finally, we will see that under certain circumstances the Normal Equations Method and the SVD may be more applicable than the QR approach.2. Existence and UniquenessIn this section, we will see that the linear Least Squares Problem Ax b alwayshas a solution, and this solution is unique if and only if the columns of A are linearlyindependent, i.e., rank(A) n, where A is an m n matrix. If rank(A) n, then Ais rank-deficient, and the solution is not unique. As such, the numerical algorithmsin the later sections of this paper will be based on the assumption that A has fullcolumn rank n.Definition 2.1. Let S Rn . The orthogonal complement of S, denoted as S , isthe set of all vectors x Rn that are orthogonal to S.One important property of orthogonal complements is the following:Rn V V ,where is the direct sum, which means that any vector x Rn can be uniquelyrepresented asx p o,

NUMERICALLY EFFICIENT METHODS FOR SOLVING LEAST SQUARES PROBLEMS3where p V and o V . Then p is called the orthogonal projection of the vectorx onto the subspace V . As a result, we have the following lemma:Lemma 2.2. Let V Rn . Let p be the orthogonal projection of a vector x Rnonto V . Then, kx vk kx pk for any v 6 p V .Proof. Let o x p, o0 x v, and v 0 p v. Then o0 o v 0 , v 0 V , andv 0 6 0. Since o V , it follows that o · v 0 0. Thus,ko0 k2 o0 · o0 (o v 0 ) · (o v 0 ) o · o v 0 · o o · v 0 v 0 · v 0 o · o v 0 · v 0 kok2 kv 0 k2 kok2 .Thus kx pk min kx vk is the shortest distance from the vector x to V . v V2.1. Least Squares Solution from Normal Equations. Recall from (1.1) thatthe Least Squares Solution x minimizes kr(x)k2 , where r(x) b Ax for x Rn .The dimension of span(A) is at most n, but if m n, b generally does not lie inspan(A), so there is no exact solution to the Least Squares Problem. In our nexttheorem, we will use Lemma 2.2 to see that Ax span(A) is closest to b whenr b Ax is orthogonal to span(A), giving rise to the system of Normal EquationsAT Ax AT b.Theorem 2.3. Let A be an m n matrix and b Rm . Then x̂ is a Least SquaresSolution of the system Ax b if and only if it is a solution of the associated normalsystem AT Ax AT b.Proof. Let x Rn . Then Ax is an arbitrary vector in the column space of A, whichwe write as R(A). As a consequence of Lemma 2.2, r(x) b Ax is minimum ifAx is the orthogonal projection of b onto R(A). Since R(A) N ull(AT ), x̂ is aLeast Squares Solution if and only ifAT r(x̂) AT (b Ax̂) 0,which is equivalent to the system of Normal EquationsAT Ax̂ AT b. For this solution to be unique, the matrix A needs to have full column rank:Theorem 2.4. Consider a system of linear equations Ax b and the associatednormal system AT Ax AT b. Then the following conditions are equivalent:(1) The Least Squares Problem has a unique solution(2) The system Ax 0 only has the zero solution(3) The columns of A are linearly independent.Proof. Let x̂ be the unique Least Squares Solution and x Rn is such that AT Ax 0. ThenAT A(x̂ x) AT Ax̂ AT b.Thus x̂ x x̂ i.e., x 0 since the normal system has a unique solution. Also,this means that AT Ax 0 only has the trivial solution, so AT A is nonsingular(AT A is nonsingular if and only if AT Av 6 0 for all non-zero v Rn ).Now it is enough to prove that (AT A) is invertible if and only if rank(A) n. Forv Rn , we have v T (AT A)v (v T AT )(Av) (Av)T (Av) kAvk22 , so AT Av 6 0

4DO Q LEEif and only if Av 6 0. For 1 i n, denote the i-th component of v by vi andthe i-th column of A by ai . Then Av is a linear combination of the columns ofA, the coefficients of which are the components of v. It follows that Av 6 0 forall non-zero v if and only if the n columns of A are linearly independent, i.e., A isfull-rank. 3. Norms and ConditioningThis section will first justify the choice of the Euclidean 2-norm from our introduction in (1.1). Then we will see how this choice of norm is closely linked toanalyzing the sensitivity and conditioning of the Least Squares Problem.3.1. Norms: Quantifying Error and Distance. A norm is a function thatassigns a positive length to all the vectors in a vector space. The most commonnorms on Rn arenP(1) The Euclidean norm: kxk2 ( x 2 )1/2 ,(2) The p-norm: kxkp (nPi 1p 1/p x )for p 1, andi 1(3) The Infinite norm: kxk max xi .1 i nGiven so many choices of norm, it is natural to ask whether they are in any waycomparable. It turns out that in finite dimensions, all norms are in fact equivalentto each other. The following theorem allows us to generalize Euclidean spaces tovector spaces of any finite dimension.Theorem 3.1. If S is a vector space of finite dimension, then all norms are equivalent on S i.e., for all pairs of norms k · kn , k · km , there exist two constants c andC such that 0 c C, and x S, we haveckxkn kxkm CkxknProof. It is enough to show that any two norms are equivalent on the unit sphereof a chosen norm, because for a general vector x Rn , we can write x γx0 ,where γ kxkn and x0 is a vector on the unit sphere. We first prove that anynorm is a continuous function: Suppose that x0 Rn . Then for every x Rn , thetriangle inequality gives kxk kx0 k kx x0 k. Thus kxk kx0 k wheneverkx x0 k .For simplicity, we work in the case n 2.PFor the second inequality, let {ei } bethe canonical basis in Rn . Then write x xi ei so thatqXqXX2xi 2kei km Ckxk2kxkm xi kei km For the first inequality, by using continuity on the unit sphere, it can be shownthat the function x kxkm has a minimum value c on the unit sphere. Now writex kxk2 x̂ so thatkxkm kxk2 kx̂km ckxk2 . With the equivalence of norms, we are now able to choose the 2-norm as our toolfor the Least Squares Problem: solutions in the 2-norm are equivalent to solutionsin any other norm.

NUMERICALLY EFFICIENT METHODS FOR SOLVING LEAST SQUARES PROBLEMS5The 2-norm is the most convenient one for our purposes because it is associatedwith an inner product. Once we have an inner product defined on a vector space,we can define both a norm and distance for the inner product space:Definition 3.2. Suppose that V is an inner product space. The norm or length ofa vector u V is defined as1kuk hu, ui 2 .The equivalence of norms also implies that any properties of accuracy and stability are independent of the norm chosen to evaluate them. Now that we havea means of quantifying distance and length, we have the tools for quantifying theerror for our Least Squares Solutions.3.2. Sensitivity and Conditioning: Perturbations to b. In general, a nonsquare m n matrix A has no inverse in the usual sense. But if A has full rank, apseudoinverse can be defined asA (AT A) 1 AT ,(3.3)and condition number κ(A) by(3.4)κ(A) kAk · kA k.Combined with Theorem 2.3, the Least Squares Solution of Ax b can be givenby x A b. We will soon see that if κ(A) 1, small perturbations in A can leadto large errors in the solution.Systems of linear equations are sensitive if a small perturbation in the matrixA or in the right-hand side b causes a significant change in the solution x. Suchsystems are called ill-conditioned. The following is an example of an ill-conditionedmatrix with respect to perturbations on b.Example 3.5. The linear system Ax b, with 112A ,b ,1 1 2 Twhere 0 1 has the solution x 2 0 . We claim this system is illconditioned because small perturbations in b alter the solution significantly. If wecompute the system Ax̂ b̂, where 2b̂ 2 Twe see this has the solution x̂ 1 1 , which is completely different from x.One way to determine sensitivity to perturbations of b is to examine the relationship between error and the residual. Consider the solution of the system Ax b,with expected answer x and computed answer x̂. We will write the error e and theresidual r ase x x̂, r b Ax̂ b b̂.Since x may not be obtained immediately, the accuracy of the solution is oftenevaluated by looking at the residualr b Ax̂ Ax Ax̂ AeWe take the norm of e to get a bound for the absolute errorkek2 kx x̂k2 kA (b b̂)k2 kA k2 kb b̂k2 kA k2 krk2

6DO Q LEEsokek2 kA k2 krk2 .Using this we can derive a bound for the relative error kek2 /kxk2 and krk2 /kbk2 :kek2 kA k2 krk2kAxk2krk2 kA k2 kAk2 kxk2.kbk2kbk2Thus(3.6)kek2krk2 κ(A),kxk2kbk2If κ(A) is large (i.e., the matrix is ill-conditioned), then relatively small perturbations of the right-hand side b (and therefore the residual) may lead to even largererrors. For well-conditioned problems (κ(A) 1) we can derive another usefulbound:krk2 kxk2 kAek2 kxk2 kAek2 kA bk2 kAk2 kek2 kA k2 kbk2so that(3.7)kek21 krk2 .κ(A) kbk2kxk2Finally, combine (3.6) and (3.7) to obtain1 krk2kx x̂k2krk2 κ(A).κ(A) kbk2kxk2kbk2These bounds are true for any A, but show that the residual is a good indicatorof the error only if A is well-conditioned. The closer κ(A) is to 1, the smaller thebound can become. On the other hand, an ill-conditioned A would allow for largevariations in the relative error.3.3. Sensitivity to perturbations in A. Small perturbations on ill-conditionedmatrices can lead to large changes in the solution.Example 3.8. For the linear system Ax b, let 1 1 A , A 1 1 with 0 1. Then consider the perturbed matrix 1 1 A A .1 1 is singular, so Âx̂ b has no solution.Consider the linear system Ax b, but now A is perturbed to  A A.Denote by x the exact solution to Ax b, and by x̂ the solution to the perturbedLeast Squares Problem Âx̂ b. Now we can write x̂ x x so thatÂx̂ (A A)(x x) bso thatAx b ( A)x A( x) ( A)( x) 0The term ( A)( x) is negligibly small compared to the other terms, so we get( x) A ( A)x.

NUMERICALLY EFFICIENT METHODS FOR SOLVING LEAST SQUARES PROBLEMS7Taking norms leads to the following result:k xk2 kA k2 k Ak2 kxk2 kA k2 kAk2k Ak2kxk2kAk2orkx x̂k2kA Âk2 κ(A).kxk2kAk24. Normal Equations MethodFrom Theorem 2.3, we have seen that finding the Least Squares Solution can bereduced to solving a system of Normal Equations AT Ax AT b. The Normal Equations Method computes the solution to the Least Squares Problem by transformingthe rectangular matrix A into triangular form.4.1. Cholesky Factorization. If A has full column rank, then the following holdsfor AT A:(1) AT A is symmetric ((AT A)T AT (AT )T AT A)(2) AT A is positive definite (xT AT Ax (Ax)T Ax kAxk22 0 if x 6 0).Thus, if an m n matrix A has rank n, then AT A is n n, symmetric, andpositive definite. In this case it is favorable to use the Cholesky Factorization, whichdecomposes any positive definite symmetric matrix into two triangular matrices sothat AT A can be expressed asAT A LLTwhere L is an n n lower triangular matrix. See [2] for a detailed analysis.4.2. Flop: Complexity of Numerical Algorithms. Triangular matrices areused extensively in numerical algorithms such as the Cholesky or the QR Factorization because triangular systems are one of the simplest systems to solve. Byderiving the flop count for triangular system solving, this section introduces flopcounting as a method of evaluating the performance of an algorithm.A flop is a floating point operation ( , , , /). In an n n unit lower triangularsystem Ly b, each yk in y is obtained by writingy k bk k 1Xlkj yj ,j 1which requires k 1 multiplications and k 1 additions. Thus y requires n2 nflops to compute. Since n is usually sufficiently large to ignore lower order terms,we say that an n-by-n forward substitution costs n2 flops.When a linear system is solved by this algorithm, the arithmetic associated withsolving triangular system is often dominated by the arithmetic required for the factorization. We only worry about the leading-order behaviors when counting flops;i.e., we assume m, n are large. Keeping this in mind, we will examine the CholeskyFactorization used for solving systems involving symmetric, positive definite matrices.

8DO Q LEE4.3. Algorithm: Computing the Cholesky Factorization. For a matrix A,define Ai:i0 ,j:j 0 to be the (i0 i 1) (j 0 j 1) submatrix of A with upper leftcorner aij and lower right corner ai0 j 0 .Algorithm 4.1.R Afor k 1 to mfor j k 1 to mRj,j:m Rj,j:m Rk,j:m Rkj /RkkRk,k:m Rk,k:m / RkkThe 4-th line dominates the operation count for this algorithm, so its flop countcan be obtained by consideringm XmX2(m j) 2m XkXj k 1 j 1k 1 j k 1mXk 2 m3 /3k 1Thus we have the following algorithm for the Normal Equations Method:(1) Calculate C AT A (C is symmetric, so mn2 flops)(2) Cholesky Factorization C LLT (n3 /3 flops)(3) Calculate d AT b (2mn flops)(4) Solve Lz d by forward substitution (n2 flops)(5) Solve LT x z by back substitution (n2 flops)This gives us the cost for large m, n : mn2 (1/3)n3 flops4.4. Shortcomings of Normal Equations. The Normal Equations Method ismuch quicker than other algorithms but is in general more unstable.Example 4.2. Consider the matrix 1A 0 10 , where 0 1. Then for very small , 1 211TA A 11 21 1,1which is singular.Conditioning of the system is also worsened: κ(AT A) [κ(A)]2 , so the NormalEquations have a relative sensitivity that is squared compared to the original LeastSquares Problem Ax b. Thus we can conclude that any numerical method usingthe Normal Equations will be unstable, since the rounding errors will correspondto (κ(A))2 instead of κ(A).5. Orthogonal Methods - The QR FactorizationThe QR Factorization is an alternative approach that avoids the shortcomingsof Normal Equations.For a matrix A Rm n with full rank n, let A a1 a2 · · · an . We wish to produce a sequence of orthonormal vectors q1 , q2 , . . .spanning the same space as the columns of A:(5.1)hq1 , . . . , qj i ha1 , . . . , aj i

NUMERICALLY EFFICIENT METHODS FOR SOLVING LEAST SQUARES PROBLEMS9for j 1, 2, . . . , n. This way, for k 1, . . . , n, each ak can be expressed as a linearcombination of q1 , . . . , qk . This is equivalent to the matrix formA QRm nwhere Q Rhas orthonormal columns, and R Rn n is upper triangular. Tounderstand the factorization we will first introduce some special classes of matricesand their properties.5.1. Orthogonal Matrices. A matrix Q Rm n with m n, has orthonormalcolumns if all columns in Q are orthogonal to every other column and are normalized. When m n so that Q is a square matrix, we can define an orthogonalmatrix:Definition 5.2. A square matrix with orthonormal columns is referred to as anorthogonal matrix.For an orthogonal matrix Q, it holds that QT Q QQT In . From this it is clearthat Q 1 QT and if Q is an orthogonal matrix, then QT is also orthogonal. Usingthese properties, we will prove the following lemma which shows that multiplyinga vector by an orthogonal matrix preserves its Euclidean norm:Lemma 5.3. Multiplying a vector with an orthogonal matrix does not change its2-norm. Thus if Q is orthogonal it holds thatkQxk2 kxk2 .TProof. kQxk22 (Qx) Qx xT QT Qx xT x kxk22 Norm preservation implies no amplification of numerical error. The greatestadvantage of using orthogonal transformations is in its numerical stability: if Qis orthogonal then κ(Q) 1. It is also clear that multiplying both sides of LeastSquares Problem (1.1) by an orthogonal matrix does not change its solution.5.2. Triangular Least Squares Problems. The upper triangular overdeterminedLeast Squares Problem can be rewritten as Rbx 1 ,Ob2where R is an n n upper triangular partition, the entries of O are all zero, and Tb b1 b2 is partitioned similarly. Then the residual becomeskrk22 kb1 Rxk22 kb2 k22 .Although kb2 k22 does not depend on x, the first term kb1 Rxk22 can be minimizedwhen x satisfies the n n triangular systemRx b1 ,which can be easily solved by back substitution. In this case, x is the Least SquaresSolution with the minimum residualkrk22 kb2 k22 .Recall that solving Least Squares Problems by Normal Equations squares thecondition number, i.e., κ(AT A) [κ(A)]2 . Combined with Lemma 5.3, we can conclude that the QR approach enhances numerical stability by avoiding this squaringeffect.

10DO Q LEE5.3. The QR Factorization in Least Squares Problems.Theorem 5.4. Given the matrix A Rm n and the right hand side b Rm , thesolution set of the Least Squares Problemmin kb Axk2x Rnis identical to the solution set ofRx QT bwhere the m n matrix Q with orthonormal columns and the upper triangular n nmatrix R, is a QR-factorization of A.Proof. Given m n matrix A with m n, we need to find an m m orthogonalmatrix Q such that RA Q,Oso that when applied to the Least Squares Problem, Ax b becomes equivalent to Rb̂(5.5)QT Ax x 1 QT b.Ob̂2where R is n n and upper triangular. If we partition Q as an m m orthogonalmatrix Q Q1 Q2 , where Q1 is m n, then RRA Q Q1 Q2 Q1 R,OOwhich is called the reduced QR Factorization of A. Then the solution to the LeastSquares Problem Ax b is given by solution to square systemQ1 T Ax Rx b̂1 Q1 T bSo the minimum value of kb Axk22 is realized when kQ1 T b Rxk22 0, i.e. whenQT b Rx 0 Although we have shown that QR is an efficient, accurate way to solve LeastSquares, exhibiting the QR Factorization is quite a different matter. A standardmethod of obtaining a QR Factorization is via Gram-Schmidt orthogonalization.Although this can be implemented numerically, it turns out not to be as stable asQR via Householder reflectors, and thus we choose to focus on the latter method.Finally, the Least Squares Problem can be solved by the following algorithm:(1) QR Factorization of A (Householder or Gram-Schmidt 2mn2 flops)(2) Form d QT b (2mn flops)(3) Solve Rx d by back substitution (n2 flops)This gives us the cost for large m, n : 2mn2 flops5.4. Calculating the QR-factorization - Householder Transformations.This section gives a rough sketch on how to calculate the QR-factorization in away that is numerically stable. The main idea is to multiply the matrix A by asequence of simple orthogonal matrices Qk in order to ”shape” A into an uppertriangular matrix Qn · · · Q2 Q1 . In the k-th step, the matrix Qk introduces zeroesbelow the diagonal in the k-th column while keeping the zeroes in previous rows.By the end of the n-th step, all the entries below the diagonal become zero, makingQT Qn · · · Q2 Q1 A R upper triangular.

NUMERICALLY EFFICIENT METHODS FOR SOLVING LEAST SQUARES PROBLEMS 11In order to construct the matrix QT , choose each orthogonal matrix Qk suchthat I 0Qk ,0 Fwhere I is a k 1 k 1 identity matrix and F is an m k 1 m k 1 matrix.The identity matrix in Qk allows us to preserve the zeroes already introducedin the first k 1 columns, while F introduces zeroes below the diagonal in the k-thcolumn. Now let x Rm k 1 . We define F as the Household Reflector, which ischosen asvv TF I 2 T ,v vwhere v kxke1 x. This achieves kxk 0 F x . kxke1 . . 0After A has been reduced to upper triangular form QT A R, the orthogonalmatrix Q is constructed by directly computing the formulaQT Qn · · · Q1 Q Q1 · · · Qn .As this process is repeated n times, each process works on smaller blocks of thematrix while deleting the entries below the diagonal in the next column. EachHouseholder transformation is applied to entire matrix, but does not affect priorcolumns, so zeros are preserved.In order to achieve the operation count, we take advantage of the matrix-vectormultiplications rather than the full use of matrix multiplication. In other words,we performvv Tv(v T A)(I 2 T )A A 2 Tv vv vwhich consists of a matrix-vector multiplication and an outer product. Comparedto the Cholesky Factorization, applying the Householder transformation this wayis much cheaper because it requires only vector v, rather than the full matrix F .Repeating this process n times gives us the Householder algorithm:Algorithm 5.6.for k 1 to nv Ak:m,kuk v kvke1uk uk /kuk kAk:m,k:n Ak:m,k:n 2uk (uk T Ak:m,k:n )Most work is done by the last line in the loop: Ak:m,k:n Ak:m,k:n 2uk (uk T Ak:m,k:n ),which costs 4(m k 1)(n k 1) flops. Thus we haveT otal f lops nXk 14(m k 1)(n k 1) 2mn2 2n3 /3

12DO Q LEE5.5. Rank Deficiency: Numerical Loss of Orthogonality. We have assumedso far that the matrix A has full rank. But if A is rank-deficient, the columns Awould be linearly dependent, so there would be some column aj in A such thataj span{q1 , . . . , qj 1 } span{a1 , . . . , aj 1 }. In this case, from 5.1 we can seethat the QR Factorization will fail. Not only that, when a matrix is close to rankdeficient, it could lose its orthogonality due to numerical loss. Consider a 2 2matrix 0.70000 0.70711A 0.70001 0.70711It is easy to check that A has full rank. But with a 5-digit accuracy of the problem,after computing the QR Factorization we see that 0.70710 0.70711Q 0.70711 0.70710is clearly not orthogonal. If such a Q is used to solve the Least Squares Problem,the system would be highly sensitive to perturbations. Nevertheless, a minimumnorm solution can still be computed with the help of Singular Value Decomposition(SVD), which will be covered in the next section.6. Singular Value Decomposition (SVD)If the QR Factorization used orthogonal transformations to reduce the LeastSquares Problem to a triangular system, the Singular Value Decomposition usesorthogonal transformations to reduce the problem into a diagonal system. We firstintroduce the Singular Value Decomposition.Theorem 6.1. Let A be an arbitrary m n matrix with m n. Then A canbe factorized as A U ΣV T , where U Rm m and V Rn n are orthogonalmatrices, and Σ diag(σ1 , . . . , σn ), where σ1 . . . σn 0.Proof. Let σ1 kAk2 . Choose v1 Rn and u1 0 Rm such that kv1 k2 1,ku1 0 k2 σ1 and u1 0 Av1 . Then normalize u1 0 as u1 u1 0 /ku1 0 k2 . To form theorthogonal matrices U1 and V1 , extend v1 and u1 to some orthonormal bases {vi }and {ui } as the orthogonal columns of each matrix. Then we have σ1 w TT(6.2)U1 AV1 S .0BThen it suffices to prove that w 0. Consider the inequality σσ1 w T σ1 σ1 2 w2 (σ1 2 w2 )1/2 k 1 k2w 2w0Bwhich gives us kSk2 (σ1 2 w2 )1/2 . But since U1 and V1 are orthogonal, from(6.2) we have kSk2 σ1 , which makes w 0.Now proceed by induction. If n 1 or m 1 the case is trivial. In other cases,the submatrix B has an SVD B U2 Σ2 V2 T by the induction hypothesis. Thus weobtain the SVD of A in the following form: T1 0 σ1 0 1 0A U1V1T0 U2 0 Σ2 0 V2

NUMERICALLY EFFICIENT METHODS FOR SOLVING LEAST SQUARES PROBLEMS 13The main idea behind the SVD is that any matrix can be transformed into adiagonal matrix if we choose the right orthogonal coordinate systems for its domainand range. So using the orthogonality of V we can write the decomposition in theformAV U Σ.6.1. The Minimum Norm Solution using SVD. If rank(A) n, Theorem2.4 tells us that the solution to the Least Squares Problem may not be unique,i.e., multiple vectors x give the minimum residual norm. Of course, rank deficiencyshould not happen in a well-formulated Least Squares Problem: the set of variablesused to fit the data should be independent in the first place. In the case when Ais rank-deficient, we can still compute the Least Squares Solution by selecting thesolution x̂ that has the smallest norm among the solutions.We begin with the pseudoinverse A from (3.3) redefined in terms of the SVD:A V Σ U Twhere Σ diag(σ1 1 , . . . , σn 1 ). Note that this pseudoinverse exists regardlessof whether the matrix is square or has full rank. To see this, suppose A U ΣV Tis the SVD of A Rn . For rank-deficient matrices with rank(A) r n, we have Σ1 0Σ , Σ1 diag(σ1 , . . . , σr ), σ1 · · · σr 0,0 0 with U and V partitioned accordingly as U U1 U2 and V V1 V2 . Thepseudoinverse A is given by 1Σ10 1T T A V Σ U V1 Σ1 U1 , where Σ 00Since Σ1 is nonsingular, A always exists, and it can be easily confirmed thatAA A A I. Now write c and y as T T U1 bc1V xyTTc U b , y V x 1T 1Tcy2U2 bV2 x2Apply the orthogonality of U and the SVD of A to the Least Squares Problem: 2c1 Σ1 y1kb Axk22 kU U T b U ΣV T xk22 kc1 Σ1 y1 k22 kc2 k22 .c22so that we have kb Axk2 kc2 k2 x Rn . Equality holds when Σ 11 c1 A b V yx V y V1 V22 2y2for any y2 Rn r , making x the general solution of the Least Squares Problem.Note that the solution is unique when r rank(A) n, in which case V1 V . Onthe other hand, if A is rank-deficient (r n), we write z as Σ 11 c1 A b V yz V y V1 V22 2y2for some nonzero y2 . Let x̂ A b, i.e., the solution when y2 0. Thenkzk22 kA bk22 ky2 k22 kxk22 .

14DO Q LEETso that the vector x A b V1 Σ 11 U1 b is the minimum norm solution to theLeast Squares Problem. Thus the minimum norm solution to the Least SquaresProblem can be obtained through the following algorithm:(1)(2)(3)(4)Compute A U ΣV T U1 Σ1 V1T , the SVD of ACompute U T bSolve Σw U T b for w (Diagonal System)Let x V wThe power of the SVD lies in the fact that it always exists and can be computedstably. The computed SVD will be well-conditioned because orthogonal matricespreserve the 2-norm. Any perturbation in A will not be amplified by the SVD sincekδAk2 kδΣk2 .6.2. Computing the SVD of Matrix A. There are several ways to compute theSVD of the matrix A. First, the nonzero Singular Values of A can be computedby an eigenvalue computation for the normal matrix AT A. However, as we haveseen in the conditioning of Normal Equations, this approach is numerically unstable.Alternatively, we can transform A to bidiagonal form using Householder reflections,and then transform this matrix into diagonal form using two sequences of orthogonalmatrices. See [3] for an in-depth analysis. While there are other versions of thismethod, the work required for computing the SVD is typically evaluated to be 2mn2 11n3 .[3] The computation of the SVD is an interesting subject in itself. Various alternative approaches are used in practice. Some methods emphasize speed (such as thedivide-and-conquer methods), while others focus on the accuracy of small SingularValues (such as some QR implementations) [3]. But in any case, once it has beencomputed, the SVD can be a powerful tool in itself.7. Comparison of MethodsSo far, we have seen three algorithms for solving the Least Squares Problem:(1) Normal equations by Cholesky Factorization costs mn2 n3 /3 flops.It is the fastest method but at the same time numerically unstabl

for the Least Squares Problem: solutions in the 2-norm are equivalent to solutions in any other norm. NUMERICALLY EFFICIENT METHODS FOR SOLVING LEAST SQUARES PROBLEMS 5 The 2-norm is the most convenient one for our purposes because it is associated with an inner product. Once we have an inner product de ned on a vector space,