Introduction To Convex Optimization For Machine Learning

Transcription

Introduction to Convex Optimization for MachineLearningJohn DuchiUniversity of California, BerkeleyPractical Machine Learning, Fall 2009Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 20091 / 53

OutlineWhat is OptimizationConvex SetsConvex FunctionsConvex Optimization ProblemsLagrange DualityOptimization AlgorithmsTake Home MessagesDuchi (UC Berkeley)Convex Optimization for Machine LearningFall 20092 / 53

What is OptimizationWhat is Optimization (and why do we care?)Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 20093 / 53

What is OptimizationWhat is Optimization? Finding the minimizer of a function subject to constraints:minimize f0 (x)xs.t. fi (x) 0, i {1, . . . , k}hj (x) 0, j {1, . . . , l}Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 20094 / 53

What is OptimizationWhat is Optimization? Finding the minimizer of a function subject to constraints:minimize f0 (x)xs.t. fi (x) 0, i {1, . . . , k}hj (x) 0, j {1, . . . , l} Example: Stock market. “Minimize variance of return subject togetting at least 50.”Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 20094 / 53

What is OptimizationWhy do we care?Optimization is at the heart of many (most practical?) machine learningalgorithms. Linear regression:minimize kXw yk2w Classification (logistic regresion or SVM):minimizewnXi 1 log 1 exp( yi xTi w)or kwk2 CDuchi (UC Berkeley)nXi 1ξi s.t. ξi 1 yi xTi w, ξi 0.Convex Optimization for Machine LearningFall 20095 / 53

What is OptimizationWe still care. Maximum likelihood estimation:maximizeθ wXi j log 1 exp(wT xi wT xj )k-means:minimize J(µ) µ1 ,.,µk log pθ (xi )i 1Collaborative filtering:minimize nXk XXj 1 i Cjkxi µj k2And more (graphical models, feature selection, active learning,control)Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 20096 / 53

What is OptimizationBut generally speaking.We’re screwed. Local (non global) minima of f0 All kinds of constraints (even restricting to continuous functions):h(x) sin(2πx) 0250200150100500 5032312010 1 1 2 2 3Duchi (UC Berkeley) 3Convex Optimization for Machine LearningFall 20097 / 53

What is OptimizationBut generally speaking.We’re screwed. Local (non global) minima of f0 All kinds of constraints (even restricting to continuous functions):h(x) sin(2πx) 0250200150100500 5032312010 1 1 2 2 3 3Go for convex problems!Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 20097 / 53

Convex SetsConvex SetsDuchi (UC Berkeley)Convex Optimization for Machine LearningFall 20098 / 53

Convex SetsDefinitionA set C Rn is convex if for x, y C and any α [0, 1],αx (1 α)y C.yxDuchi (UC Berkeley)Convex Optimization for Machine LearningFall 20099 / 53

Convex SetsExamples All of Rn (obvious)Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200910 / 53

Convex SetsExamples All of Rn (obvious) Non-negative orthant, Rn : let x 0, y 0, clearlyαx (1 α)y 0.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200910 / 53

Convex SetsExamples All of Rn (obvious) Non-negative orthant, Rn : let x 0, y 0, clearlyαx (1 α)y 0. Norm balls: let kxk 1, kyk 1, thenkαx (1 α)yk kαxk k(1 α)yk α kxk (1 α) kyk 1.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200910 / 53

Convex SetsExamplesAffine subspaces: Ax b, Ay b, thenA(αx (1 α)y) αAx (1 α)Ay αb (1 α)b b.10.80.60.4x3 0.20 0.2 0.410.810.60.80.60.40.40.2x2Duchi (UC Berkeley)0.200x1Convex Optimization for Machine LearningFall 200911 / 53

Convex SetsMore examples ArbitraryT intersections of convex sets: let Ci be convex for i I,C i Ci , thenx C, y C αx (1 α)y Ci i Iso αx (1 α)y C.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200912 / 53

Convex SetsMore examplesPSD Matrices, a.k.a. thepositive semidefinite coneSn Rn n . A Sn meansxT Ax 0 for all x Rn . ForA, B S n,xT (αA (1 α)B) x αxT Ax (1 α)xT Bx 0.10.80.6z 0.40.20110.5 On right:y x z2S 0 x, y, z : x 0, y 0, xy z 2z y0.800.60.4 0.5 1Duchi (UC Berkeley)Convex Optimization for Machine Learning00.2xFall 200913 / 53

Convex FunctionsConvex FunctionsDuchi (UC Berkeley)Convex Optimization for Machine LearningFall 200914 / 53

Convex FunctionsDefinitionA function f : Rn R is convex if for x, y dom f and any α [0, 1],f (αx (1 α)y) αf (x) (1 α)f (y).αf(x) (1 - α)f(y)f (y)f (x)Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200915 / 53

Convex FunctionsFirst order convexity conditionsTheoremSuppose f : Rn R is differentiable. Then f is convex if and only if forall x, y dom ff (y) f (x) f (x)T (y x)f(y)f(x) f(x)T (y - x)(x, f(x))Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200916 / 53

Convex FunctionsActually, more general than thatDefinitionThe subgradient set, or subdifferential set, f (x) of f at x is f (x) g : f (y) f (x) g T (y x) for all y .f (y)Theoremf : Rn R is convex if andonly if it has non-emptysubdifferential set everywhere.(x, f(x))f (x) g T (y - x)Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200917 / 53

Convex FunctionsSecond order convexity conditionsTheoremSuppose f : Rn R is twice differentiable. Then f is convex if and only iffor all x dom f , 2 f (x) 0.1086420221100 1 1 2Duchi (UC Berkeley) 2Convex Optimization for Machine LearningFall 200918 / 53

Convex FunctionsConvex sets and convex functionsDefinitionThe epigraph of a function f is theset of pointsepi fepi f {(x, t) : f (x) t}. epi f is convex if and only if f isconvex. Sublevel sets, {x : f (x) a}are convex for convex f .Duchi (UC Berkeley)aConvex Optimization for Machine LearningFall 200919 / 53

Convex FunctionsExamplesExamples Linear/affine functions:f (x) bT x c.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200920 / 53

Convex FunctionsExamplesExamples Linear/affine functions:f (x) bT x c. Quadratic functions:1f (x) xT Ax bT x c2for A 0. For regression:111kXw yk2 wT X T Xw y T Xw y T y.222Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200920 / 53

Convex FunctionsExamplesMore examples Norms (like ℓ1 or ℓ2 for regularization):kαx (1 α)yk kαxk k(1 α)yk α kxk (1 α) kyk .Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200921 / 53

Convex FunctionsExamplesMore examples Norms (like ℓ1 or ℓ2 for regularization):kαx (1 α)yk kαxk k(1 α)yk α kxk (1 α) kyk . Composition with an affine function f (Ax b):f (A(αx (1 α)y) b) f (α(Ax b) (1 α)(Ay b)) αf (Ax b) (1 α)f (Ay b)Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200921 / 53

Convex FunctionsExamplesMore examples Norms (like ℓ1 or ℓ2 for regularization):kαx (1 α)yk kαxk k(1 α)yk α kxk (1 α) kyk . Composition with an affine function f (Ax b):f (A(αx (1 α)y) b) f (α(Ax b) (1 α)(Ay b)) αf (Ax b) (1 α)f (Ay b) Log-sum-exp (via 2 f (x) PSD):f (x) lognXexp(xi )i 1Duchi (UC Berkeley)!Convex Optimization for Machine LearningFall 200921 / 53

Convex FunctionsExamplesImportant examples in Machine Learning3 SVM loss:[1 - x] f (w) 1 yi xTi w Binary logistic loss: f (w) log 1 exp( yi xTi w)Duchi (UC Berkeley)log(1 ex )0 2Convex Optimization for Machine Learning3Fall 200922 / 53

Convex Optimization ProblemsConvex Optimization ProblemsDefinitionAn optimization problem is convex if its objective is a convex function, theinequality constraints fj are convex, and the equality constraints hj areaffineminimize f0 (x)x(Convex function)s.t. fi (x) 0 (Convex sets)hj (x) 0 (Affine)Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200923 / 53

Convex Optimization ProblemsIt’s nice to be convexTheoremIf x̂ is a local minimizer of a convex optimization problem, it is a globalminimizer.4x 3.532.521.510.50Duchi (UC Berkeley)0.511.522.5Convex Optimization for Machine Learning33.5Fall 200924 / 53

Convex Optimization ProblemsEven more reasons to be convexTheorem f (x) 0 if and only if x is a global minimizer of f (x).Proof. f (x) 0. We havef (y) f (x) f (x)T (y x) f (x). f (x) 6 0. There is a direction of descent.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200925 / 53

Convex Optimization ProblemsLET’S TAKE A BREAKDuchi (UC Berkeley)Convex Optimization for Machine LearningFall 200926 / 53

Lagrange DualityLagrange DualityDuchi (UC Berkeley)Convex Optimization for Machine LearningFall 200927 / 53

Lagrange DualityGoals of Lagrange Duality Get certificate for optimality of a problem Remove constraints Reformulate problemDuchi (UC Berkeley)Convex Optimization for Machine LearningFall 200928 / 53

Lagrange DualityConstructing the dual Start with optimization problem:minimize f0 (x)xs.t. fi (x) 0, i {1, . . . , k}hj (x) 0, j {1, . . . , l} Form Lagrangian using Lagrange multipliers λi 0, νi RL(x, λ, ν) f0 (x) kXλi fi (x) lXνj hj (x)j 1i 1Form dual functiong(λ, ν) inf L(x, λ, ν) infxDuchi (UC Berkeley)x f0 (x) kXλi fi (x) i 1Convex Optimization for Machine LearninglXj 1 νj hj (x) Fall 200929 / 53

Lagrange DualityRemarks Original problem is equivalent to"minimizex #sup L(x, λ, ν)λ 0,νDual problem is switching the min and max:himaximize inf L(x, λ, ν) .λ 0,νDuchi (UC Berkeley)xConvex Optimization for Machine LearningFall 200930 / 53

Lagrange DualityOne Great Property of DualLemma (Weak Duality)If λ 0, theng(λ, ν) f0 (x ).Proof.We haveg(λ, ν) inf L(x, λ, ν) L(x , λ, ν)x f0 (x ) kXi 1Duchi (UC Berkeley)λi fi (x ) lXj 1νj hj (x ) f0 (x ).Convex Optimization for Machine LearningFall 200931 / 53

Lagrange DualityThe Greatest Property of the DualTheoremFor reasonable1 convex problems,sup g(λ, ν) f0 (x )λ 0,ν1There are conditions called constraint qualification for which this is trueDuchi (UC Berkeley)Convex Optimization for Machine LearningFall 200932 / 53

Lagrange DualityGeometric LookMinimize 12 (x c 1)2 subject to x2 c.140.612100.480.2604 0.220 0.4 2 2 1.5 1 0.50x0.511.5200.20.40.60.811.21.41.61.82λTrue function (blue), constraintDual function g(λ) (black), primal op(green), L(x, λ) for different λtimal (dotted blue)(dotted)Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200933 / 53

Lagrange DualityIntuitionCan interpret duality as linear approximation.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200934 / 53

Lagrange DualityIntuitionCan interpret duality as linear approximation. I (a) if a 0, 0 otherwise; I0 (a) unless a 0. Rewriteproblem asminimize f0 (x) xDuchi (UC Berkeley)kXi 1I (fi (x)) Convex Optimization for Machine LearninglXI0 (hj (x))j 1Fall 200934 / 53

Lagrange DualityIntuitionCan interpret duality as linear approximation. I (a) if a 0, 0 otherwise; I0 (a) unless a 0. Rewriteproblem asminimize f0 (x) x kXi 1I (fi (x)) lXI0 (hj (x))j 1Replace I(fi (x)) with λi fi (x); a measure of “displeasure” whenλi 0, fi (x) 0. νi hj (x) lower bounds I0 (hj (x)):minimize f0 (x) xDuchi (UC Berkeley)kXλi fi (x) i 1Convex Optimization for Machine LearninglXνj hj (x)j 1Fall 200934 / 53

Lagrange DualityExample: Linearly constrained least squaresminimizex1kAx bk22s.t. Bx d.Form the Lagrangian:L(x, ν) 1kAx bk2 ν T (Bx d)2Take infimum: x L(x, ν) AT Ax AT b B T ν x (AT A) 1 (AT b B T ν)Simple unconstrained quadratic problem!inf L(x, ν)x 1A(AT A) 1 (AT b B T ν) b2Duchi (UC Berkeley)2 ν T B((AT A) 1 AT b B T ν) dT νConvex Optimization for Machine LearningFall 200935 / 53

Lagrange DualityExample: Quadratically constrained least squares1kAx bk22Form the Lagrangian (λ 0):minimizexL(x, λ) s.t.1kxk2 c.211kAx bk2 λ(kxk2 2c)22Take infimum: x L(x, ν) AT Ax AT b λI x (AT A λI) 1 AT b12 λ2A(AT A λI) 1 AT b b (AT A λI) 1 AT b λcx22One variable dual problem!inf L(x, λ) 11g(λ) bT A(AT A λI) 1 AT b λc kbk2 .22Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200936 / 53

Lagrange DualityUses of the Dual Main use: certificate of optimality (a.k.a. duality gap). If we havefeasible x and know the dual g(λ, ν), theng(λ, ν) f0 (x ) f0 (x) f0 (x ) f0 (x) g(λ, ν) f0 (x) f0 (x) f0 (x ) f0 (x) g(λ, ν).Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200937 / 53

Lagrange DualityUses of the Dual Main use: certificate of optimality (a.k.a. duality gap). If we havefeasible x and know the dual g(λ, ν), theng(λ, ν) f0 (x ) f0 (x) f0 (x ) f0 (x) g(λ, ν) f0 (x) f0 (x) f0 (x ) f0 (x) g(λ, ν). Also used in more advanced primal-dual algorithms (we won’t talkabout these).Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200937 / 53

Optimization AlgorithmsOptimization AlgorithmsDuchi (UC Berkeley)Convex Optimization for Machine LearningFall 200938 / 53

Optimization AlgorithmsGradient MethodsGradient DescentThe simplest algorithm in the world (almost). Goal:minimize f (x)xJust iteratext 1 xt ηt f (xt )where ηt is stepsize.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200939 / 53

Optimization AlgorithmsGradient MethodsSingle Step Illustrationf (x)(xt , f(xt ))f (xt ) - η f(xt )T (x - xt )Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200940 / 53

Optimization AlgorithmsGradient MethodsFull Gradient Descentf (x) log(exp(x1 3x2 .1) exp(x1 3x2 .1) exp( x1 .1))Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200941 / 53

Optimization AlgorithmsGradient MethodsStepsize SelectionHow do I choose a stepsize? Idea 1: exact line searchηt argmin f (x η f (x))ηToo expensive to be practical.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200942 / 53

Optimization AlgorithmsGradient MethodsStepsize SelectionHow do I choose a stepsize? Idea 1: exact line searchηt argmin f (x η f (x))ηToo expensive to be practical. Idea 2: backtracking (Armijo) line search. Let α (0, 21 ), β (0, 1).Multiply η βη untilf (x η f (x)) f (x) αη k f (x)k2Works well in practice.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200942 / 53

Optimization AlgorithmsGradient MethodsIllustration of Armijo/Backtracking Line Searchf(x - η f(x))f(x) - ηαk f(x)k2ηηt 0η0As a function of stepsize η. Clearly a region where f (x η f (x)) isbelow line f (x) αη k f (x)k2 .Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200943 / 53

Optimization AlgorithmsGradient MethodsNewton’s methodIdea: use a second-order approximation to function.1f (x x) f (x) f (x)T x xT 2 f (x) x2Choose x to minimize above: 1 x 2 f (x) f (x)This is descent direction: 1 f (x)T x f (x)T 2 f (x) f (x) 0.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200944 / 53

Optimization AlgorithmsGradient MethodsNewton step picturefˆ(x, f(x))f(x x, f(x x))fˆ is 2nd -order approximation, f is true function.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200945 / 53

Optimization AlgorithmsGradient MethodsConvergence of gradient descent and Newton’s method Strongly convex case: 2 f (x) mI, then “Linear convergence.” Forsome γ (0, 1), f (xt ) f (x ) γ t , γ 1.f (xt ) f (x ) γ t or t 11log f (xt ) f (x ) ε.γεSmooth case: k f (x) f (y)k C kx yk.f (xt ) f (x ) Kt2Newton’s method often is faster, especially when f has “long valleys”Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200946 / 53

Optimization AlgorithmsGradient MethodsWhat about constraints? Linear constraints Ax b are easy. For example, in Newton method(assume Ax b):1minimize f (x)T x xT 2 f (x) x x2s.t. A x 0.Solution x satisfies A(x x) Ax A x b.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200947 / 53

Optimization AlgorithmsGradient MethodsWhat about constraints? Linear constraints Ax b are easy. For example, in Newton method(assume Ax b):1minimize f (x)T x xT 2 f (x) x x2s.t. A x 0.Solution x satisfies A(x x) Ax A x b. Inequality constraints are a bit tougher.1minimize f (x)T x xT 2 f (x) x x2s.t. fi (x x) 0just as hard as original.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200947 / 53

Optimization AlgorithmsGradient MethodsLogarithmic Barrier MethodsGoal:minimize f0 (x)xs.t. fi (x) 0, i {1, . . . , k}Convert tominimize f0 (x) xkXi 1I (fi (x))Approximate I (u) t log( u) for small t.minimize f0 (x) txDuchi (UC Berkeley)kXlog ( fi (x))i 1Convex Optimization for Machine LearningFall 200948 / 53

Optimization AlgorithmsGradient MethodsThe barrier function100 2 3u01I (u) is dotted line, others are t log( u).Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200949 / 53

Optimization AlgorithmsGradient MethodsIllustrationMinimizing cT x subject to Ax b.ct 1Duchi (UC Berkeley)t 5Convex Optimization for Machine LearningFall 200950 / 53

Optimization AlgorithmsSubgradient MethodsSubgradient DescentReally, the simplest algorithm in the world. Goal:minimize f (x)xJust iteratext 1 xt ηt gtwhere ηt is a stepsize, gt f (xt ).Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200951 / 53

Optimization AlgorithmsSubgradient MethodsWhy subgradient descent? Lots of non-differentiable convex functions used in machine learning: f (x) 1 aT x ,f (x) kxk1 ,f (X) kXσr (X)r 1where σr is the rth singular value of X. Easy to analyze Do not even need true sub-gradient: just have Egt f (xt ).Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200952 / 53

Optimization AlgorithmsSubgradient MethodsProof of convergence for subgradient descentIdea: bound kxt 1 x k using subgradient inequality. Assume thatkgt k G.kxt 1 x k2 kxt ηgt x k2 kxt x k2 2ηgtT (xt x ) η 2 kgt k2Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200953 / 53

Optimization AlgorithmsSubgradient MethodsProof of convergence for subgradient descentIdea: bound kxt 1 x k using subgradient inequality. Assume thatkgt k G.kxt 1 x k2 kxt ηgt x k2 kxt x k2 2ηgtT (xt x ) η 2 kgt k2Recall thatf (x ) f (xt ) gtT (x xt )so gtT (xt x ) f (x ) f (xt )kxt 1 x k2 kxt x k2 2η [f (x ) f (xt )] η 2 G2 .Thenf (xt ) f (x ) Duchi (UC Berkeley)kxt x k2 kxt 1 x k2 η 2 G .2η2Convex Optimization for Machine LearningFall 200953 / 53

Optimization AlgorithmsSubgradient MethodsAlmost done.Sum from t 1 to T :TXt 1Ti Tη1 Xhf (xt ) f (x ) kxt x k2 kxt 1 x k2 G22η2Duchi (UC Berkeley) t 11Tη 21kx1 x k2 kxT 1 x k2 G 2η2η2Convex Optimization for Machine LearningFall 200954 / 53

Optimization AlgorithmsSubgradient MethodsAlmost done.Sum from t 1 to T :TXt 1Ti Tη1 Xhf (xt ) f (x ) kxt x k2 kxt 1 x k2 G22η2 t 11Tη 21kx1 x k2 kxT 1 x k2 G 2η2η2Now let D kx1 x k, and keep track of min along run,f (xbest ) f (x ) Set η D G T1ηD 2 G2 .2ηT2andDuchi (UC Berkeley)DGf (xbest ) f (x ) .TConvex Optimization for Machine LearningFall 200954 / 53

Optimization AlgorithmsSubgradient MethodsExtension: projected subgradient descentNow have a convex constraint set X.Goal:minimize f (x)xtx XΠX (xt )Idea: do subgradient steps, projectxt back into X at every iteration.xt 1 ΠX (xt ηgt )Xx Proof:kΠX (xt ) x k kxt x kif x X.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200955 / 53

Optimization AlgorithmsSubgradient MethodsProjected subgradient exampleminimizexDuchi (UC Berkeley)1kAx bk2s.t. kxk1 1Convex Optimization for Machine LearningFall 200956 / 53

Optimization AlgorithmsSubgradient MethodsProjected subgradient exampleminimizexDuchi (UC Berkeley)1kAx bk2s.t. kxk1 1Convex Optimization for Machine LearningFall 200956 / 53

Optimization AlgorithmsSubgradient MethodsProjected subgradient exampleminimizexDuchi (UC Berkeley)1kAx bk2s.t. kxk1 1Convex Optimization for Machine LearningFall 200956 / 53

Optimization AlgorithmsSubgradient MethodsProjected subgradient exampleminimizexDuchi (UC Berkeley)1kAx bk2s.t. kxk1 1Convex Optimization for Machine LearningFall 200956 / 53

Optimization AlgorithmsSubgradient MethodsProjected subgradient exampleminimizexDuchi (UC Berkeley)1kAx bk2s.t. kxk1 1Convex Optimization for Machine LearningFall 200956 / 53

Optimization AlgorithmsSubgradient MethodsProjected subgradient exampleminimizexDuchi (UC Berkeley)1kAx bk2s.t. kxk1 1Convex Optimization for Machine LearningFall 200956 / 53

Optimization AlgorithmsSubgradient MethodsProjected subgradient exampleminimizexDuchi (UC Berkeley)1kAx bk2s.t. kxk1 1Convex Optimization for Machine LearningFall 200956 / 53

Optimization AlgorithmsSubgradient MethodsProjected subgradient exampleminimizexDuchi (UC Berkeley)1kAx bk2s.t. kxk1 1Convex Optimization for Machine LearningFall 200956 / 53

Optimization AlgorithmsSubgradient MethodsConvergence results for (projected) subgradient methods Any decreasing, non-summable stepsize ηt 0, f xavg(t) f (x ) 0.Duchi (UC Berkeley)Convex Optimization for Machine LearningP t 1 ηt givesFall 200957 / 53

Optimization AlgorithmsSubgradient MethodsConvergence results for (projected) subgradient methods Any decreasing, non-summable stepsize ηt 0, f xavg(t) f (x ) 0.P t 1 ηt gives Slightly less brain-dead analysis than earlier shows with ηt 1/ t Cf xavg(t) f (x ) tDuchi (UC Berkeley)Convex Optimization for Machine LearningFall 200957 / 53

Optimization AlgorithmsSubgradient MethodsConvergence results for (projected) subgradient methods Any decreasing, non-summable stepsize ηt 0, f xavg(t) f (x ) 0.P t 1 ηt gives Slightly less brain-dead analysis than earlier shows with ηt 1/ t Cf xavg(t) f (x ) t Same convergence when gt is random, i.e. Egt f (xt ). Example:nX 11 yi xTi w f (w) kwk2 C2i 1Just pick random training example.Duchi (UC Berkeley)Convex Optimization for Machine LearningFall 200957 / 53

Take Home MessagesRecap Defined convex sets and functions Saw why we want optimization problems to be convex (solvable) Sketched some of Lagrange duality First order methods are easy and (often) work wellDuchi (UC Berkeley)Convex Optimization for Machine LearningFall 200958 / 53

Take Home MessagesTake Home Messages Many useful problems formulated as convex optimization problems If it is not convex and not an eigenvalue problem, you are out of luck If it is convex, you are goldenDuchi (UC Berkeley)Convex Optimization for Machine LearningFall 200959 / 53

Optimization is at the heart of many (most practical?) machine learning algorithms. Linear regression: minimize w kXw yk2 Classification (logistic regresion or SVM): minimize w Xn i 1 log 1 exp( yixT i w) or kwk2 C Xn i 1 ξi s.t. ξi 1 yixTiw,ξi 0. Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 5 / 53