Optimization For Machine Learning - X-Files

Transcription

Optimization for Machine Learning

Neural Information Processing SeriesMichael I. Jordan and Thomas Dietterich, editorsAdvances in Large Margin Classifiers, Alexander J. Smola, Peter L. Bartlett,Bernhard Schölkopf, and Dale Schuurmans, eds., 2000Advanced Mean Field Methods: Theory and Practice, Manfred Opper andDavid Saad, eds., 2001Probabilistic Models of the Brain: Perception and Neural Function, RajeshP. N. Rao, Bruno A. Olshausen, and Michael S. Lewicki, eds., 2002Exploratory Analysis and Data Modeling in Functional Neuroimaging,Friedrich T. Sommer and Andrzej Wichert, eds., 2003Advances in Minimum Description Length: Theory and Applications, PeterD. Grünwald, In Jae Myung, and Mark A. Pitt, eds., 2005Nearest-Neighbor Methods in Learning and Vision: Theory and Practice,Gregory Shakhnarovich, Piotr Indyk, and Trevor Darrell, eds., 2006New Directions in Statistical Signal Processing: From Systems to Brains, Simon Haykin, José C. Prı́ncipe, Terrence J. Sejnowski, and John McWhirter,eds., 2007Predicting Structured Data, Gökhan BakIr, Thomas Hofmann, BernhardSchölkopf, Alexander J. Smola, Ben Taskar, and S. V. N. Vishwanathan,eds., 2007Toward Brain-Computer Interfacing, Guido Dornhege, José del R. Millán,Thilo Hinterberger, Dennis J. McFarland, and Klaus-Robert Müller, eds.,2007Large-Scale Kernel Machines, Léon Bottou, Olivier Chapelle, Denis DeCoste, and Jason Weston, eds., 2007Learning Machine Translation, Cyril Goutte, Nicola Cancedda, MarcDymetman, and George Foster, eds., 2009Dataset Shift in Machine Learning, Joaquin Quiñonero-Candela, MasashiSugiyama, Anton Schwaighofer, and Neil D. Lawrence, eds., 2009Optimization for Machine Learning, Suvrit Sra, Sebastian Nowozin, andStephen J. Wright, eds., 2012

Optimization for Machine LearningEdited by Suvrit Sra, Sebastian Nowozin, and Stephen J. WrightThe MIT PressCambridge, MassachusettsLondon, England

2012 Massachusetts Institute of TechnologyAll rights reserved. No part of this book may be reproduced in any form by anyelectronic or mechanical means (including photocopying, recording, orinformation storage and retrieval) without permission in writing from thepublisher.For information about special quantity discounts, please emailspecial sales@mitpress.mit.eduThis book was set in LaTeX by the authors and editors. Printed and bound in theUnited States of America.Library of Congress Cataloging-in-Publication DataOptimization for machine learning / edited by Suvrit Sra, Sebastian Nowozin, andStephen J. Wright.p. cm. — (Neural information processing series)Includes bibliographical references.ISBN 978-0-262-01646-9 (hardcover : alk. paper) 1. Machine learning—Mathematical models. 2. Mathematical optimization. I. Sra, Suvrit, 1976– II.Nowozin, Sebastian, 1980– III. Wright, Stephen J., 1960–Q325.5.O65 2012006.3'1—c22201100205910 9 8 7 6 5 4 3 2 1

ContentsSeries ForewordxiPrefacexiii1 Introduction: Optimization and Machine Learning.1271115.1919262732343640474849S. Sra, S. Nowozin, and S. J. Wright1.11.21.31.4Support Vector MachinesRegularized OptimizationSummary of the ChaptersReferences . . . . . . . . .2 Convex Optimization with Sparsity-Inducing NormsF. Bach, R. Jenatton, J. Mairal, and G. on . . . . . . . . . . . . . . . .Generic Methods . . . . . . . . . . . . .Proximal Methods . . . . . . . . . . . .(Block) Coordinate Descent AlgorithmsReweighted- 2 Algorithms . . . . . . . .Working-Set Methods . . . . . . . . . .Quantitative Evaluation . . . . . . . . .Extensions . . . . . . . . . . . . . . . . .Conclusion . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . .3 Interior-Point Methods for Large-Scale Cone ProgrammingM. Andersen, J. Dahl, Z. Liu, and L. Vandenberghe3.13.23.33.43.53.6Introduction . . . . . . . . . . . . . .Primal-Dual Interior-Point MethodsLinear and Quadratic ProgrammingSecond-Order Cone Programming . .Semidefinite Programming . . . . . .Conclusion . . . . . . . . . . . . . .55566064717479

vi3.7References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .794 Incremental Gradient, Subgradient, and Proximal Methodsfor Convex Optimization: A 139145146D. P. Bertsekas4.14.24.34.44.54.64.7Introduction . . . . . . . . . . . . . . . . . . . . . .Incremental Subgradient-Proximal Methods . . . .Convergence for Methods with Cyclic Order . . . .Convergence for Methods with Randomized OrderSome Applications . . . . . . . . . . . . . . . . . .Conclusions . . . . . . . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . . . . . . . .5 First-Order Methods for Nonsmooth Convex Large-ScaleOptimization, I: General Purpose MethodsA. Juditsky and A. Nemirovski5.15.25.35.45.55.65.75.85.9Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .Mirror Descent Algorithm: Minimizing over a Simple Set . .Problems with Functional Constraints . . . . . . . . . . . .Minimizing Strongly Convex Functions . . . . . . . . . . . .Mirror Descent Stochastic Approximation . . . . . . . . . .Mirror Descent for Convex-Concave Saddle-Point ProblemsSetting up a Mirror Descent Method . . . . . . . . . . . . .Notes and Remarks . . . . . . . . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 First-Order Methods for Nonsmooth Convex Large-ScaleOptimization, II: Utilizing Problem’s StructureA. Juditsky and A. Nemirovski6.16.26.36.46.56.66.7Introduction . . . . . . . . . . . . . . . . . . . . . . .Saddle-Point Reformulations of Convex MinimizationMirror-Prox Algorithm . . . . . . . . . . . . . . . . .Accelerating the Mirror-Prox Algorithm . . . . . . .Accelerating First-Order Methods by RandomizationNotes and Remarks . . . . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . . . . . . . . .149. . . . . 149Problems151. . . . . 154. . . . . 160. . . . . 171. . . . . 179. . . . . 1817 Cutting-Plane Methods in Machine Learning185Introduction to Cutting-plane Methods . . . . . . . . . . . . 187Regularized Risk Minimization . . . . . . . . . . . . . . . . . 191Multiple Kernel Learning . . . . . . . . . . . . . . . . . . . . 197V. Franc, S. Sonnenburg, and T. Werner7.17.27.3

vii7.47.5MAP Inference in Graphical Models . . . . . . . . . . . . . . 203References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2148 Introduction to Dual Decomposition for InferenceD. Sontag, A. Globerson, and T. Jaakkola8.18.28.38.48.58.68.78.88.10Introduction . . . . . . . . . . . . . . . . . . . . .Motivating Applications . . . . . . . . . . . . . .Dual Decomposition and Lagrangian RelaxationSubgradient Algorithms . . . . . . . . . . . . . .Block Coordinate Descent Algorithms . . . . . .Relations to Linear Programming Relaxations . .Decoding: Finding the MAP Assignment . . . . .Discussion . . . . . . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . . . . . . .219. 220. 222. 224. 229. 232. 240. 242. 245. 2529 Augmented Lagrangian Methods for Learning, Selecting,and Combining 98300302R. Tomioka, T. Suzuki, and M. Sugiyama9.19.29.39.49.59.69.79.9Introduction . . . . . . . . . . . . . . . . . . . .Background . . . . . . . . . . . . . . . . . . . .Proximal Minimization Algorithm . . . . . . .Dual Augmented Lagrangian (DAL) AlgorithmConnections . . . . . . . . . . . . . . . . . . . .Application . . . . . . . . . . . . . . . . . . . .Summary . . . . . . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . . . . . .10 The Convex Optimization Approach to RegretMinimizationE. Hazan10.110.210.310.410.510.6Introduction . . . . . . . . . . . . . . .The RFTL Algorithm and Its AnalysisThe “Primal-Dual” Approach . . . . .Convexity of Loss Functions . . . . . .Recent Applications . . . . . . . . . .References . . . . . . . . . . . . . . . .11 Projected Newton-type Methods in Machine Learning30511.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30511.2 Projected Newton-type Methods . . . . . . . . . . . . . . . . 30611.3 Two-Metric Projection Methods . . . . . . . . . . . . . . . . 312M. Schmidt, D. Kim, and S. Sra

viii11.411.511.611.7Inexact Projection Methods . .Toward Nonsmooth ObjectivesSummary and Discussion . . .References . . . . . . . . . . . .316320326327.331. 331. 333. 337. 338. 344. 347. 34712 Interior-Point Methods in Machine LearningJ. Gondzio12.112.212.312.412.512.612.7Introduction . . . . . . . . . . . . . . . . . . .Interior-Point Methods: Background . . . . .Polynomial Complexity Result . . . . . . . .Interior-Point Methods for Machine LearningAccelerating Interior-Point Methods . . . . .Conclusions . . . . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . . . . .13 The Tradeoffs of Large-Scale Learning.351. 351. 352. 355. 363. 366. 367Introduction . . . . . . . . . . . . . . . . . . . . . . . . .Background on Robust Optimization . . . . . . . . . . .Robust Optimization and Adversary Resistant LearningRobust Optimization and Regularization . . . . . . . . .Robustness and Consistency . . . . . . . . . . . . . . . .Robustness and Generalization . . . . . . . . . . . . . .Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . . . . . . . . . . .369370371373377390394399399.403403404406409L. Bottou and O. Bousquet13.113.213.313.413.513.6Introduction . . . . . . . . .Approximate OptimizationAsymptotic Analysis . . . .Experiments . . . . . . . . .Conclusion . . . . . . . . .References . . . . . . . . . .14 Robust Optimization in Machine LearningC. Caramanis, S. Mannor, and H. Xu14.114.214.314.414.514.614.714.815 Improving First and Second-Order Methods by ModelingUncertaintyN. Le Roux, Y. Bengio, and A. Fitzgibbon15.115.215.315.4Introduction . . . . . . . . . . . . . .Optimization Versus Learning . . . .Building a Model of the Gradients .The Relative Roles of the Covariance. . . . . . . . . . . . . . . . . . . . . . . . . . . .and the Hessian.

ix15.5 A Second-Order Model of the Gradients . . . . . . . . . . . .15.6 An Efficient Implementation of Online Consensus Gradient:TONGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .15.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .15.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41241441942742916 Bandit View on Noisy Optimization.431. 431. 433. 434. 443. 452J.-Y. Audibert, S. Bubeck, and R. Munos16.116.216.316.416.5Introduction . . . . . . . .Concentration InequalitiesDiscrete Optimization . .Online Optimization . . .References . . . . . . . . .17 Optimization Methods for Sparse Inverse 2487491494K. Scheinberg and S. Ma17.117.217.317.417.5Introduction . . . . . . . . . . . . . .Block Coordinate Descent Methods .Alternating Linearization Method . .Remarks on Numerical PerformanceReferences . . . . . . . . . . . . . . .18 A Pathwise Algorithm for Covariance SelectionV. Krishnamurthy, S. D. Ahipaşaoğlu, and A. d’Aspremont18.118.218.318.418.518.6Introduction . . . . . . . . .Covariance Selection . . . .Algorithm . . . . . . . . . .Numerical Results . . . . .Online Covariance SelectionReferences . . . . . . . . . .

Series ForewordThe yearly Neural Information Processing Systems (NIPS) workshops bringtogether scientists with broadly varying backgrounds in statistics, mathematics, computer science, physics, electrical engineering, neuroscience, andcognitive science, unified by a common desire to develop novel computational and statistical strategies for information processing and to understand the mechanisms for information processing in the brain. In contrastto conferences, these workshops maintain a flexible format that both allowsand encourages the presentation and discussion of work in progress. Theythus serve as an incubator for the development of important new ideas inthis rapidly evolving field. The series editors, in consultation with workshop organizers and members of the NIPS Foundation Board, select specificworkshop topics on the basis of scientific excellence, intellectual breadth,and technical impact. Collections of papers chosen and edited by the organizers of specific workshops are built around pedagogical introductorychapters, while research monographs provide comprehensive descriptions ofworkshop-related topics, to create a series of books that provides a timely,authoritative account of the latest developments in the exciting field of neural computation.Michael I. Jordan and Thomas G. Dietterich

PrefaceThe intersection of interests between machine learning and optimizationhas engaged many leading researchers in both communities for some yearsnow. Both are vital and growing fields, and the areas of shared interest areexpanding too. This volume collects contributions from many researcherswho have been a part of these efforts.We are grateful first to the contributors to this volume. Their cooperationin providing high-quality material while meeting tight deadlines is highlyappreciated. We further thank the many participants in the two workshopson Optimization and Machine Learning, held at the NIPS Workshops in2008 and 2009. The interest generated by these events was a key motivatorfor this volume. Special thanks go to S. V. N. Vishawanathan (Vishy)for organizing these workshops with us, and to PASCAL2, MOSEK, andMicrosoft Research for their generous financial support for the workshops.S. S. thanks his father for his constant interest, encouragement, and advicetowards this book. S. N. thanks his wife and family. S. W. thanks allthose colleagues who introduced him to machine learning, especially ParthaNiyogi, to whose memory his efforts on this book are dedicated.Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright

1Introduction: Optimization and MachineLearningSuvrit Srasuvrit.sra@tuebingen.mpg.deMax Planck Insitute for Biological CyberneticsTübingen, GermanySebastian NowozinMicrosoft ResearchCambridge, United KingdomSebastian.Nowozin@Microsoft.comStephen J. WrightUniversity of WisconsinMadison, Wisconsin, USAswright@cs.wisc.eduSince its earliest days as a discipline, machine learning has made use ofoptimization formulations and algorithms. Likewise, machine learning hascontributed to optimization, driving the development of new optimizationapproaches that address the significant challenges presented by machinelearning applications. This cross-fertilization continues to deepen, producinga growing literature at the intersection of the two fields while attractingleading researchers to the effort.Optimization approaches have enjoyed prominence in machine learning because of their wide applicability and attractive theoretical properties. Whiletechniques proposed twenty years and more ago continue to be refined, theincreased complexity, size, and variety of today’s machine learning modelsdemand a principled reassessment of existing assumptions and techniques.This book makes a start toward such a reassessment. Besides describingthe resurgence in novel contexts of established frameworks such as firstorder methods, stochastic approximations, convex relaxations, interior-pointmethods, and proximal methods, the book devotes significant attention tonewer themes such as regularized optimization, robust optimization, a variety of gradient and subgradient methods, and the use of splitting techniquesand second-order information. We aim to provide an up-to-date account of

2Introductionthe optimization techniques useful to machine learning — those that areestablished and prevalent, as well as those that are rising in importance.To illustrate our aim more concretely, we review in Section 1.1 and 1.2two major paradigms that provide focus to research at the confluence ofmachine learning and optimization: support vector machines (SVMs) andregularized optimization. Our brief review charts the importance of theseproblems and discusses how both connect to the later chapters of this book.We then discuss other themes — applications, formulations, and algorithms— that recur throughout the book, outlining the contents of the variouschapters and the relationship between them.This book is targeted to a broad audience of researchers andstudents in the machine learning and optimization communities; but thematerial covered is widely applicable and should be valuable to researchersin other related areas too. Some chapters have a didactic flavor, coveringrecent advances at a level accessible to anyone having a passing acquaintancewith tools and techniques in linear algebra, real analysis, and probability.Other chapters are more specialized, containing cutting-edge material. Wehope that from the wide range of work presented in the book, researcherswill gain a broader perspective of the field, and that new connections willbe made and new ideas sparked.For background relevant to the many topics discussed in this book, we referto the many good textbooks in optimization, machine learning, and relatedsubjects. We mention in particular Bertsekas (1999) and Nocedal and Wright(2006) for optimization over continuous variables, and Ben-Tal et al. (2009)for robust optimization. In machine learning, we refer for background toVapnik (1999), Schölkopf and Smola (2002), Christianini and Shawe-Taylor(2000), and Hastie et al. (2009). Some fundamentals of graphical modelsand the use of optimization therein can be found in Wainwright and Jordan(2008) and Koller and Friedman (2009).Audience.1.1Support Vector MachinesThe support vector machine (SVM) is the first contact that many optimization researchers had with machine learning, due to its classical formulationas a convex quadratic program — simple in form, though with a complicating constraint. It continues to be a fundamental paradigm today, with newalgorithms being proposed for difficult variants, especially large-scale andnonlinear variants. Thus, SVMs offer excellent common ground on which todemonstrate the interplay of optimization and machine learning.

1.1Support Vector Machines1.1.13BackgroundThe problem is one of learning a classification function from a set of labeledtraining examples. We denote these examples by {(xi , yi ), i 1, . . . , m},where xi Rn are feature vectors and yi { 1, 1} are the labels. In thesimplest case, the classification function is the signum of a linear function ofthe feature vector. That is, we seek a weight vector w Rn and an interceptb R such that the predicted label of an example with feature vector x isf (x) sgn(wT x b). The pair (w, b) is chosen to minimize a weighted sumof: (a) a measure of the classification error on the training examples; and(b) w 22 , for reasons that will be explained in a moment. The formulationis thus mminimize 21 wT w Cξii 1w,b,ξ(1.1)subject to yi (wT xi b) 1 ξi , ξi 0, 1 i m.Note that the summation term in the objective contains a penalty contribution from term i if yi 1 and wT xi b 1, or yi 1 and wT xi b 1.If the data are separable, it is possible to find a (w, b) pair for which thispenalty is zero. Indeed, it is possible to construct two parallel hyperplanes inRn , both of them orthogonal to w but with different intercepts, that containno training points between them. Among all such pairs of planes, the pairfor which w 2 is minimal is the one for which the separation is greatest.Hence, this w gives a robust separation between the two labeled sets, and istherefore, in some sense, most desirable. This observation accounts for thepresence of the first term in the objective of (1.1).Problem (1.1) is a convex quadratic program with a simple diagonalHessian but general constraints. Some algorithms tackle it directly, but formany years it has been more common to work with its dual, which isminimize 21 αT Y X T XY α αT 1α yi αi 0, 0 αi C,subject to(1.2)iwhere Y Diag(y1 , . . . , ym ) and X [x1 , . . . , xm ] Rn m . This dual isalso a quadratic program. It has a positive semidefinite Hessian and simplebounds, plus a single linear constraint.More powerful classifiers allow the inputs to come from an arbitrary setX, by first mapping the inputs into a space H via a nonlinear (feature)mapping φ : X H, and then solving the classification problem to find(w, b) with w H. The classifier is defined as f (x) : sgn( w, φ(x) b),and it can be found by modifying the Hessian from Y X T XY to Y KY ,

4Introductionwhere Kij : φ(xi ), φ(xj ) is the kernel matrix. The optimal weight vector can be recovered from the dual solution by setting w mi 1 αi φ(xi ), so mthat the classifier is f (x) sgn [ i 1 αi φ(xi ), φ(x) b].In fact, it is not even necessary to choose the mapping φ explicitly.We need only define a kernel mapping k : X X R and define thematrix K directly from this function by setting Kij : k(xi , xj ). Theclassifier can be written purely in terms of the kernel mapping k as follows: f (x) sgn [ mi 1 αi k(xi , x) b].1.1.2Classical ApproachesThere has been extensive research on algorithms for SVMs since at least themid-1990s, and a wide variety of techniques have been proposed. Out-ofthe-box techniques for convex quadratic programming have limited appealbecause usually the problems have large size, and the Hessian in (1.2) can bedense and ill-conditioned. The proposed methods thus exploit the structureof the problem and the requirements on its (approximate) solution. Wesurvey some of the main approaches here.One theme that recurs across many algorithms is decomposition appliedto the dual (1.2). Rather than computing a step in all components of αat once, these methods focus on a relatively small subset and fix the othercomponents. An early approach due to Osuna et al. (1997) works with asubset B {1, 2, . . . , s}, whose size is assumed to exceed the number ofnonzero components of α in the solution of (1.2); their approach replacesone element of B at each iteration and then re-solves the reduced problem(formally, a complete reoptimization is assumed, though heuristics are usedin practice). The sequential minimal optimization (SMO) approach of Platt(1999) works with just two components of α at each iteration, reducingeach QP subproblem to triviality. A heuristic selects the pair of variablesto relax at each iteration. LIBSVM1 (see Fan et al., 2005) implements anSMO approach for (1.2) and a variety of other SVM formulations, with aparticular heuristic based on second-order information for choosing the pairof variables to relax. This code also uses shrinking and caching techniqueslike those discussed below.SVMlight 2 (Joachims, 1999) uses a linearization of the objective around thecurrent point to choose the working set B to be the indices most likely to givedescent, giving a fixed size limitation on B. Shrinking reduces the workloadfurther by eliminating computation associated with components of α that1. http://www.csie.ntu.edu.tw/ cjlin/libsvm/2. http://www.cs.cornell.edu/People/tj/svm light/

1.1Support Vector Machines5seem to be at their lower or upper bounds. The method nominally requirescomputation of B columns of the kernel K at each iteration, but columnscan be saved and reused across iterations. Careful implementation of gradient evaluations leads to further computational savings. In early versionsof SVMlight , the reduced QP subproblem was solved with an interior-pointmethod (see below), but this was later changed to a coordinate relaxationprocedure due to Hildreth (1957) and D’Esopo (1959). Zanni et al. (2006)use a similar method to select the working set, but solve the reduced problemusing nonmontone gradient projection, with Barzilai-Borwein step lengths.One version of the gradient projection procedure is described by Dai andFletcher (2006).Interior-point methods have proved effective on convex quadratic programs in other domains, and have been applied to (1.2) (see Ferris andMunson, 2002; Gertz and Wright, 2003). However, the density, size, andill-conditioning of the kernel matrix make achieving efficiency difficult. Toameliorate this difficulty, Fine and Scheinberg (2001) propose a method thatreplaces the Hessian with a low-rank approximation (of the form V V T ,m) and solves the resulting modified dual. Thiswhere V Rm r for rapproach works well on problems of moderate scale, but may be too expensive for larger problems.In recent years, the usefulness of the primal formulation (1.1) as the basisof algorithms has been revisited. We can rewrite this formulation as anunconstrained minimization involving the sum of a quadratic and a convexpiecewise-linear function, as follows:minimizew,b1 T2w w CR(w, b),where the penalty term is defined by mR(w, b) : max(1 yi (wT xi b), 0).i 1(1.3)(1.4)Joachims (2006) describes a cutting-plane approach that builds up a convex piecewise-linear lower bounding function for R(w, b) based on subgradient information accumulated at each iterate. Efficient management of theinequalities defining the approximation ensures that subproblems can besolved efficiently, and convergence results are proved. Some enhancementsare decribed in Franc and Sonnenburg (2008), and the approach is extendedto nonlinear kernels by Joachims and Yu (2009). Implementations appear inthe code SVMperf .33. http://www.cs.cornell.edu/People/tj/svm light/svm perf.html

6IntroductionThere has also been recent renewed interest in solving (1.3) by stochasticgradient methods. These appear to have been proposed originally by Bottou(see, for example, Bottou and LeCun, 2004) and are based on taking a stepin the (w, b) coordinates, in a direction defined by the subgradient in a singleterm of the sum in (1.4). Specifically, at iteration k, we choose a steplengthγk and an index ik {1, 2, . . . , m}, and update the estimate of w as follows: w γk (w mCyik xik )if 1 yik (wT xik b) 0,w otherwise.w γk wTypically, one uses γk 1/k. Each iteration is cheap, as it needs to observejust one training point. Thus, many iterations are needed for convergence;but in many large practical problems, approximate solutions that yield classifiers of sufficient accuracy can be found in much less time than is taken byalgorithms that aim at an exact solution of (1.1) or (1.2). Implementations ofthis general approach include SGD4 and Pegasos (see Shalev-Shwartz et al.,2007). These methods enjoy a close relationship with stochastic approximation methods for convex minimization; see Nemirovski et al. (2009) and theextensive literature referenced therein. Interestingly, the methods and theirconvergence theory were developed independently in the two communities,with little intersection until 2009.1.1.3Approaches Discussed in This BookSeveral chapters of this book discuss the problem (1.1) or variants thereof.In Chapter 12, Gondzio gives some background on primal-dual interiorpoint methods for quadratic programming, and shows how structure canbe exploited when the Hessian in (1.2) is replaced by an approximation ofthe form Q0 V V T , where Q0 is nonnegative diagonal and V Rm r withrm, as above. The key is careful design of the linear algebra operationsthat are used to form and solve the linear equations which arise at eachiteration of the interior-point method. Andersen et al. in Chapter 3 alsoconsider interior-point methods with low-rank Hessian approximations, butthen go on to discuss robust and multiclass variants of (1.1). The robustvariants, which replace each training vector xi with an ellipsoid centered atxi , can be formulated as second-order cone programs and solved with aninterior-point method.A similar model for robust SVM is considered by Caramanis et al. inChapter 14, along with other variants involving corrupted labels, missing4. http://leon.bottou.org/projects/sgd.

1.2Regularized Optimization7data, nonellipsoidal uncertainty sets, and kernelization. This chapter alsoexplores the connection between robust formulations and the regularizationterm w 22 that appears in (1.1).As Schmidt et al. note in Chapter 11, omission of the intercept term b fromthe formulation (1.1) (which can often be done without seriously affectingthe quality of the classifier) leads to a dual (1.2) with no equality constraint— it becomes a bound-constrained convex quadratic program. As such, theproblem is amenable to solution by gradient projection methods with secondorder acceleration on the components of α that satisfy the bounds.Chapter 13, by Bottou and Bousquet, describes application of SGD to(1.1) and several other machine learning problems. It also places the problem in context by considering other types of errors that arise in its formulation, namely, the errors incurred by restricting the classifier to a finitelyparametrized class of functions and by using an empirical, discretized approximation to the objective

Optimization for machine learning / edited by Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright. p. cm. — (Neural information processing series) Includes bibliographical references. ISBN 978-0-262-01646-9 (hardcover : alk. paper) 1. Machine learning— Mathematical models