Optimization In Learning And Data Analysis

Transcription

Optimization in Learning and Data AnalysisStephen WrightUniversity of Wisconsin-MadisonAugust 2013Wright (UW-Madison)Optimization in LearningAugust 20131 / 60

Big PictureOptimization provides a powerful toolbox for solving data analysis andlearning problems.The particular requirements of data analysis problems are driving newresearch in optimization — much of it being done by machinelearning researchers.Some old lines of optimization research are suddenly new again!Interesting intersections with systems — multicore and clusters.Wright (UW-Madison)Optimization in LearningAugust 20132 / 60

OutlineI. Sketch some canonical formulations of data analysis / machinelearning problems as optimization problems.II. An optimization toolbox: Fundamental formulation and algorithmictechniques from optimization that are featuring strongly in dataanalysis.II I. Show how the optimization tools are mixed and matched to addressdata analysis tasks.III. Illustrating new work at the intersection of optimization, systems, andbig data: asynchronous multicore algorithms.Stochastic gradient (Hogwild!);Coordinate descent;Application to extreme linear programming.Wright (UW-Madison)Optimization in LearningAugust 20133 / 60

I. Canonical FormulationsLinear regression variable selection (LASSO)Compressed sensingSupport vector machinesLogistic regressionMatrix completionInverse covariance estimationDeep belief networksImage processingData assimilation.Wright (UW-Madison)Optimization in LearningAugust 20134 / 60

Linear RegressionGiven a set of feature vectors ai Rn and outcomes bi , i 1, 2, . . . , m,find weights x that predict the outcome accurately: aiT x bi .Least Squares: Under certain assumptions on measurement error / noise,can find a suitable x by solving a least squares problemm11X Tmin kAx bk22 (ai x bi )2x 22i 1where the rows of A areaiT ,i 1, 2, . . . , m.Robust Regression: Can replace the sum-of-squares with loss functionsthat are less sensitive to outliers. Objectives are still separable, one termper data element.mX 1 : min kAx bk1 aiT x bi ,xHuber:Wright (UW-Madison)minxi 1mXi 1h(aiT x bi ),Optimization in Learning(h is Huber loss).August 20135 / 60

Feature Selection: LASSOCan modify least-squares for feature selection by adding a LASSOregularizer (Tibshirani, 1996):LASSO:minx1kAx bk22 λkxk1 ,2for some parameter λ 0. This identifies an approximate minimizer of theleast-squares loss with few nonzeros (sparse).Can state equivalently in constrained form:minx1kAx bk22 subject to kxk1 T ,2for parameter T 0.Wright (UW-Madison)Optimization in LearningAugust 20136 / 60

Compressed SensingThe 2 - 1 formulation of compressed sensing is identical to LASSO:minx1kAx bk22 λkxk1 ,2but dimensions and properties of A are typically different.There is an approximate solution to Ax b that is known to besparse. (x could be coefficients of a signal in some basis.)A is m n with m n, with narrow column submatrices wellconditioned (restricted isometry or incoherence). Usually random.b is a vector of observations.Can recover the sparse x from this convex formulation, despiteundersampling x. Number of observations m needs to be only a modestmultiple of the number of nonzeros in x. (Candès et al., 2006)Wright (UW-Madison)Optimization in LearningAugust 20137 / 60

Support Vector ClassificationGiven data vectors xi Rn , for i 1, 2, . . . , m and labels yi 1 toindicate the class to which xi belongs.Seek z such that (usually) we havexiT z 1 when yi 1 and xiT z 1 when yi 1.SVM with hinge loss to penalize misclassifications. Objective is separable(as in regession):f (z) CmXi 11max(1 yi (z T xi ), 0) kzk2 ,2where C 0 is a parameter. Define Kij yi yj xiT xj for dual:1min αT K α 1T α subject to 0 α C 1.α 2Extends to nonlinear kernel: Kij : yi yj k(xi , xj ) for kernel function k(·, ·).(Boser et al., 1992; Vapnik, 1999)Wright (UW-Madison)Optimization in LearningAugust 20138 / 60

(Regularized) Logistic RegressionSeek odds function parametrized by z Rn :p (x; z) : (1 e zTx) 1 ,p (x; z) : 1 p (z; w ),choosing z so that p (xi ; z) 1 when yi 1 and p (xi ; z) 1 whenyi 1. Scaled, negative log likelihood function L(z) is XX1log p (xi ; z) log p (xi ; z) L(z) myi 1yi 1Add regularizer λkzk1 to select features.M classes: yij 1 if data point i is in class j; yij 0 otherwise. z[j] is thesubvector of z for class j. NMMXXX1TT f (z) yij (z[j]xi ) log exp(z[j]xi ) .Ni 1Wright (UW-Madison)j 1j 1Optimization in LearningAugust 20139 / 60

Matrix CompletionSeek a matrix X Rm n with some desired structure (e.g. low rank) thatmatches certain observations, possibly noisy.minX1kA(X ) bk22 λψ(X ),2where A(X ) is a linear mapping of the components of X (e.g.observations of certain elements of X ).Setting ψ as the nuclear norm (sum of singular values) promotes low rank(in the same way as kxk1 tends to promote sparsity of a vector x).Can impose other structures, e.g. X is the sum of sparse matrix and alow-rank matrix. (Element-wise 1-norm kX k1 is useful for sparsity.)Used in recommender systems, e.g. Netflix, Amazon.(Recht et al., 2010)Wright (UW-Madison)Optimization in LearningAugust 201310 / 60

Inverse Covariance EstimationGiven m samples y1 , y2 , . . . , ym of a Gaussian random variableY N (µ; C ), the log-likelihood of the inverse covariance P isL(P) log p(y1 , ., yn P) log det(P) hS, Pi constantPTwhere S m1 mi 1 (yi µ)(yi µ) is the sample covariance.Zeros in P reveal conditional independencies between components of Y :Pij 0 Yi Yj {Yk , k 6 i, j}Sparsity in P is encouraged by adding an element-wise 1 regularizer:min log det(P) hS, Pi τ kPk1 .P 0Sparsity reveals only the important dependencies i.e. structure of thenetwork. (Friedman et al., 2008)Wright (UW-Madison)Optimization in LearningAugust 201311 / 60

Deep Belief NetworksDeep Belief Nets / NeuralNets transform feature vectorsprior to classification.Example of a deep belief network for autoencoding (Hinton, 2007). Output (at top)depends on input (at bottom)of an image with 28 28 pixels. The unknowns are parameters of the matrices W1 , W2 ,W3 , W4 ; output is nonlinear inthese parameters.Wright (UW-Madison)Optimization in LearningAugust 201312 / 60

Deep Belief NetworksOutput of a DBN can form the input to a classifier (e.g. SVM, orsomething simpler, like a max of the output features).Objectives in learning problems based on DBNs areseparable: objective is composed of terms that each depend on oneitem of data (e.g. one utterance, one character, one image) andpossibly its neighbors in space or time.nonlinear, nonconvex: each layer is simple (linear transformation,sigmoid, softmax), but their composition is not.possibly regularized with terms that impose structure. e.g. phonemeclass depends on sounds that came before and after.Wright (UW-Madison)Optimization in LearningAugust 201313 / 60

Image ProcessingNatural images are not random! They tend to have large areas ofnear-constant intensity or color, separated by sharp edges.Denoising / Deblurring: Given an image with noise or blur, seek a“nearby natural image.”Wright (UW-Madison)Optimization in LearningAugust 201314 / 60

Total Variation RegularizationApply an 1 penalty to spatial gradients in the 2D image, defined byu : Ω R,Ω : [0, 1] [0, 1],Given a noisy image f : Ω R, solve for u: (Rudin et al., 1992)ZZ2min (u(x) f (x)) dx λ k u(x)k2 dx.uWright (UW-Madison)ΩΩOptimization in LearningAugust 201315 / 60

Data AssimilationThere are thriving communities in computational science that studyPDE-constrained optimization and in particular data assimilation. Thelatter is the basis of weather forecasting.These are based on parametrized partial differential equation models,whose parameters are determined fromdata (huge, heterogeneous): observations of the system state atdifferent points in space and time;statistical models of noise, in both the PDE model and observations;prior knowledge about the solution, such as a guess of the optimalvalue and an estimate of its reliability.Needs models (meteorology and oceanography), statistics, optimization,scientific computing, physics, applied math,.There is active research on better noise models (better covariances).Wright (UW-Madison)Optimization in LearningAugust 201316 / 60

II. Optimization Formulations: Typical PropertiesData, from which we want to extract key information, makeinferences about future / missing data, or guide decisions.Parametrized model that captures the relationship between the dataand the meaning we are trying to extract.Objective that measures the mismatch between current model /parameters and observed data; also deviation from prior knowledge ordesired structure.Specific typical properties of learning problems areBig data;Often, need only low-medium accuracy;Have some prior knowledge about the model parameters;Have some desired structure for the parameters. Regularization.In some cases, the optimization formulation is well settled: See above.In other areas, formulation is a matter of ongoing debate!Wright (UW-Madison)Optimization in LearningAugust 201317 / 60

Optimization ToolboxA selection of fundamental optimization techniques that feature stronglyin the applications above.Most have a long history, but the slew of interesting new applications andcontexts has led to new twists and better understanding.Accelerated Gradient (and its cousins)Stochastic GradientCoordinate DescentShrinking techniques for regularized formulationsHigher-order methodsAugmented Lagrangians, Splitting, ADMM.Describe each briefly, then show how they are deployed to solve theapplications in Part I.Wright (UW-Madison)Optimization in LearningAugust 201318 / 60

Gradient Methods: Steepest Descentmin f (x), with smooth convex f . First-order methods calculate f (xk ) ateach iteration, and do something with it.Compare these methods on the smooth convex case:µI 2 f (x) LI for all x(0 µ L).Steepest Descent setsxk 1 xk αk f (xk ),for some αk 0.When µ 0, set αk 2/(µ L) to get linear convergence, at ratedepending on conditioning κ : L/µ: 2kL2kx0 x k2 .f (xk ) f (x ) 1 2κ 1Need O(κ log ) iterations to reduce the error by a factor .We can’t improve much on these rates by using more sophisticated choicesof αk — they’re a fundamental limitation of searching along f (xk ).Wright (UW-Madison)Optimization in LearningAugust 201319 / 60

Momentum!First-order methods can be improved dramatically using momentum:xk 1 xk αk f (xk ) βk (xk xk 1 ).Search direction is a combination of previous search direction xk xk 1and latest gradient f (xk ). Methods in this class include: Heavy-Ball,Conjugate Gradient, Accelerated Gradient.Heavy-ball sets14 ,αk L (1 1/ κ)2 22βk 1 .κ 1 to get a linear convergence rate with constant approximately 1 2/ κ. Thus requires about O( κ log ) to achieve precision of , vs. aboutO(κ log ) for steepest descent. (Polyak, 1987)For κ 100, heavy-ball is 10 times faster than steepest descent.Wright (UW-Madison)Optimization in LearningAugust 201320 / 60

Accelerated Gradient MethodsAccelerate the rate to 1/k 2 for weakly convex, while retaining the linear rate (based on κ) for strongly convex case.One of Nesterov’s methods (Nesterov, 1983, 2004) is:0: Choose x0 , α0 (0, 1); set y0 x0 ./k: xk 1 yk L1 f (yk ); (*short-step gradient*)2solve for αk 1 (0, 1): αk 1 (1 αk 1 )αk2 αk 1 /κ;2set βk αk (1 αk )/(αk αk 1 );set yk 1 xk 1 βk (xk 1 xk ). (*update with momentum*)Separates “steepest descent” contribution from “momentum”contribution, producing two sequences {xk } and {yk }.Still works for weakly convex (κ ).FISTA (Beck and Teboulle, 2009) is similar.Extends easily to problems with convex constraints, regularization, etc.Wright (UW-Madison)Optimization in LearningAugust 201321 / 60

yk 2xk 2yk 1xk 1xkykWright (UW-Madison)Optimization in LearningAugust 201322 / 60

Stochastic GradientStill deal with (weakly or strongly) convex f . But change the rules:Allow f nonsmooth.Don’t calculate function values f (x).Can evaluate cheaply an unbiased estimate of a vector from thesubgradient f .Consider the finite sum:m1 Xfi (x),f (x) mi 1where each fi is convex and m is huge. Often, each fi is a loss functionassociated with ith data item (SVM, regression, .), or a mini-batch.Classical SG: Choose index ik {1, 2, . . . , m} uniformly at random atiteration k, setxk 1 xk αk fik (xk ),for some steplength αk 0. (Robbins and Munro, 1951)Wright (UW-Madison)Optimization in LearningAugust 201323 / 60

Classical SGSuppose f is strongly convex with modulus µ, there is a bound M on thesize of the gradient estimates:m1 Xk fi (x)k2 M 2mi 1for all x of interest. Convergence obtained for the expected square error:1ak : E (kxk x k2 ).2Elementary argument shows a recurrence:1ak 1 (1 2µαk )ak αk2 M 2 .2When we set αk 1/(kµ), a neat inductive argument reveals a 1/k rate: 2Q 2 Mak ,for Q : max kx1 x k , 2 .2kµWright (UW-Madison)Optimization in LearningAugust 201324 / 60

SG VariantsConstant Stepsize. Set a target for ak , and figure out how to chooseconstant step αk α and number of iterations K to hit this target. Robust SG: Primal Averaging. Choose αk θ/(M k) (for someparameter θ), generate xk as above, and form a weighted average of alliterates:Pkαi xi.x̄k Pi 1ki 1 αiConvergence can be slower, but more stable than the classical approach.Also works for µ 0, and less sensitive to estimates of parameters.Dual Averaging. Take a step based on the average of all gradientestimates seen so far. Again, more stable behavior.SG is not a descent method, so please don’t call it SGD!Wright (UW-Madison)Optimization in LearningAugust 201325 / 60

Coordinate DescentAgain consider unconstrained minimization for smooth f : Rn R:min f (x).x RnIteration k of coordinate descent (CD) picks one index ik and takes a stepin the ik component of x to decrease f .It may be proportional to the gradient w.r.t. xik :xk 1,i xk,i α[ f (xk )]ik ,or we may actually minimize f along the ik direction.Deterministic CD: choose ik in some fixed order e.g. cyclic;Stochastic CD: choose ik at random from {1, 2, . . . , n}.CD is a reasonable choice when it’s cheap to evaluate individual elementsof f (x) (at 1/n of the cost of a full gradient, say).Wright (UW-Madison)Optimization in LearningAugust 201326 / 60

Coordinate Descent IllustratedWright (UW-Madison)Optimization in LearningAugust 201327 / 60

Coordinate Descent: Extensions and ConvergenceBlock variants of CD choose a subset [k] {1, 2, . . . , n} of components atiteration k, and take a step in those.Can also apply coordinate descent when there are bounds on componentsof x. Or, more generally, constraints that are separable with respect to theblocks in a block CD method.Similar extensions to separable regularization functions (see below).Convergence: Deterministic (Luo and Tseng, 1992; Tseng, 2001), linearrate (Beck and Tetruashvili, 2013). Stochastic, linear rate: (Nesterov,2012).Wright (UW-Madison)Optimization in LearningAugust 201328 / 60

RegularizationOften have prior knowledge of structure in the solution.Most famously: sparsity of the unknown vector x (few nonzeros).Low-rank and/or sparsity of an unknown matrix X .“Naturalness” of an image vector uij , for i, j 1, 2, . . . , N. (Largeareas of constant intensity/color separated by sharp edges.)Group sparsity: Nonzeros in x appear in predefined clusters, arisingfor example as subtrees of a tree.Enforcing these requirements in the obvious way gives rise to intractableoptimization problems.But often, can add regularization functions to the objective or constraintsto obtain convex formulations whose solutions have the desired sparsity.Wright (UW-Madison)Optimization in LearningAugust 201329 / 60

1 and SparsityIt had been known for some time (since the 1970s?) that adding an 1norm in the formulation of minx f (x) tended to produce sparse solutions.min f (x) λkxk1 ,xmin f (x) subject to kxk1 T .xWhen f is a convex function, these formulations are also convex.Candès et al. (2006) showed that when f (x) (1/2)kAx bk22 , theconvex 1 formulation has the same solution x as the (intractable)cardinality-constrained formulation. Compressed Sensing.Requires certain conditions on A (e.g. restricted isometry; see above). Thenumber of observations (rows of A) needed to recover x is related to thenumber of nonzeros — depends only logarithmically on the full dimensionof x.Wright (UW-Madison)Optimization in LearningAugust 201330 / 60

Other StructuresFor other types of structure, the trick becomes to define a regularizationfunction ψ(x) that induces the desired structure.Many of these ψ are obvious generalizations of 1 . Examples:Low rank of matrix X : ψ(X ) kX k sum of singular values of X .Nuclear norm.Sparse plus low-rank X : Split X S T and use regularizerkT k kSk1 , where kSk1 is element-wise 1-norm.Natural image: Total Variation:N 1X N 1X ui 1,j ui,j kukTV : .ui,j 1 ui,ji 1 j 1Group sparse: Sum of 2 :ψ(x) mXkx[j] k2 ,j 1where [j] {1, 2, . . . , n} are the groups of components.Wright (UW-Madison)Optimization in LearningAugust 201331 / 60

Shrink OperatorsThe regularizers ψ are usually simple, convex, nonsmooth, separable.Shrink operators are useful in extending algorithms from smooth settingmin f (x)xto the regularized settingmin f (x) λψ(x).xIn smooth setting, step isx x αg ,which is the solution of1kz (x αg )k222Can extend this to the regularized case by simply adding ψ(x): Shrink!x arg minzx arg minzWright (UW-Madison)1kz (x αg )k22 αλψ(x) Sαλ (x αg ).2Optimization in LearningAugust 201332 / 60

Using ShrinksFor many regularizers ψ, the shrink operation is cheap: O(n) operations.Often, can even solve it it in closed form.Constraints can also be enforced via shrinks. Formin f (x)x Ωcan definewith Ω closed, convex(0ψ(x) iΩ (x) if x Ω.otherwiseThe shrink operator becomes a projection onto Ω.Steepest descent, accelerated gradient, stochastic gradient, higher-ordercan all be extended to regularized case by replacing the line step with ashrink operation.Wright (UW-Madison)Optimization in LearningAugust 201333 / 60

Newton’s MethodHigher-order methods are founded in Newton’s method, which takes thestep to be the minimizer of a second-order approximation to a smoothfunction f :1d arg min f (x) f (x)T d d T 2 f (x)d,d2and sets x x d. When 2 f (x) is positive definite, can compute d asthe solution of 2 f (x)d f (x).The usual drawback is the difficulty of computing 2 f (x). But there aremany variants of Newton for which this is not necessary.Wright (UW-Madison)Optimization in LearningAugust 201334 / 60

Higher-Order MethodsModify Newton’s method in various ways:quasi-Newton (e.g. L-BFGS): use gradient information to maintain anapproximation to 2 f (x);inexact Newton: Use a method for iterative linear equations to solvefor d. Requires only matrix-vector multiplications with 2 f (x), whichcan be approximated by finite differences.approximate Newton: compute a cheap approximation to 2 f (x),perhaps by sampling (Byrd et al., 2011):X 2 f (x) 2 fi (x)i STake higher-order steps only on a reduced space. Example: for sparsex ( 1 regularization), take a reduced Newton step on the nonzerocomponents of x — once these are identified.Wright (UW-Madison)Optimization in LearningAugust 201335 / 60

Augmented LagrangianConsider linearly constrained problem:min f (x) s.t. Ax b.Augmented Lagrangian isρL(x, λ; ρ) : f (x) λT (Ax b) kAx bk22 ,2where ρ 0. Basic augmented Lagrangian / method of multipliers isxk arg min L(x, λk 1 ; ρk );xλk λk 1 ρk (Axk b);(choose ρk 1 ).Extends in a straightforward way to inequality and nonlinear constraints.Dates to 1969: Hestenes, Powell, Rockafellar, Bertsekas, Conn / Gould /Toint.Wright (UW-Madison)Optimization in LearningAugust 201336 / 60

ADMMAlternating Directions Method of Multipliers (ADMM) exploits apartitioning of the objective and/or constraints. Givenmin f (x) h(z) subject to Ax Bz c,(x,z)we have LagrangianρL(x, z, λ; ρ) : f (x) h(z) λT (Ax Bz c) kAx Bz ck22 .2Minimize over x and z separately — not jointly, as standard augmentedLagrangian would require:xk arg min L(x, zk 1 , λk 1 ; ρk );xzk arg min L(xk , z, λk 1 ; ρk );zλk λk 1 ρk (Axk Bzk c).(Useful when minimizations w.r.t x and z are much easier than jointminimization.)Wright (UW-Madison)Optimization in LearningAugust 201337 / 60

ADMMMany recent applications to compressed sensing, image processing, matrixcompletion, sparse principal components analysis, etc.(Eckstein and Bertsekas, 1992; Boyd et al., 2011)The surge of interest in ADMM is clear from the citation index forEckstein and Bertsekas’ 1992 paper:Wright (UW-Madison)Optimization in LearningAugust 201338 / 60

ADMM for Awkward Intersectionsmin f (x) s.t. x Ω1 Ω2 . . . Ωm .Reformulate with master variable x and copies x1 , x2 , . . . , xm :minx,x 1 ,x 2 ,.,x mf (x) s.t. x i Ωi , x i x 0, i 1, 2, . . . , m.Minimizations over Ωi can be done independently:ρkxki arg min (λik 1 )T (x i xk 1 ) kxk x i k22 , i 1, 2, . . . , m.xi Ωi2Optimize over the master variable (unconstrained)xk arg min f (x) xmXρkii(λik 1 )T (x xk 1) kx xk 1k22 ,2i 1Update multipliers:λik λik 1 ρk (xk xki ), i 1, 2, . . . , m.Wright (UW-Madison)Optimization in LearningAugust 201339 / 60

II I: Matching Tools to ApplicationsReturn to the applications in Part I and mention how the optimizationtools of Part II have been used to solve them. The tools are oftencombined in different ways.Linear Regression.Linear algebra for k · k2 . (Traditional!)Stochastic gradient for m n (e.g. parallel version described inRaghu’s keynote yesterday).Variable Selection & Compressed Sensing.Shrink algorithms (for 1 term) (Wright et al., 2009).Accelerated Gradient (Beck and Teboulle, 2009).ADMM (Zhang et al., 2010).Higher-order: reduced inexact Newton (Wen et al., 2010);interior-point (Fountoulakis and Gondzio, 2013)(Also homotopy in λ, LARS, .) (Efron et al., 2004)Wright (UW-Madison)Optimization in LearningAugust 201340 / 60

Support Vector Machines.Coordinate Descent (Platt, 1999; Chang and Lin, 2011).Stochastic gradient (Bottou and LeCun, 2004; Shalev-Shwartz et al.,2007).Higher-order methods (interior-point) (Ferris and Munson, 2002; Fineand Scheinberg, 2001); (on reduced space) (Joachims, 1999).Shrink Algorithms (Duchi and Singer, 2009; Xiao, 2010).Stochastic gradient shrink higher-order (Lee and Wright, 2012).Logistic Regression ( Regularization).Shrink algorithms reduced Newton (Shevade and Keerthi, 2003;Shi et al., 2008).Newton (Lin et al., 2008; Lee et al., 2006)Stochastic gradient (many!)Coordinate Descent (Meier et al., 2008)Matrix Completion.(Block) Coordinate Descent (Wen et al., 2012).Shrink (Cai et al., 2008; Lee et al., 2010).Stochastic Gradient (Lee et al., 2010).Wright (UW-Madison)Optimization in LearningAugust 201341 / 60

Inverse Covariance.Coordinate Descent (Friedman et al., 2008)Accelerated Gradient (d’Aspremont et al., 2008)ADMM (Goldfarb et al., 2012; Scheinberg and Ma, 2012)Deep Belief Networks.Stochastic Gradient (Le et al., 2012)Higher-order (LBFGS, approximate Newton) (Martens, 2010).ShrinksCoordinate descent (pretraining) (Hinton et al., 2006).Image Processing.Shrink algorithms, gradient projection (Figueiredo and Nowak, 2003;Zhu et al., 2010)Higher-order methods: interior-point (Chan et al., 1999), reducedNewton.Augmented Lagrangian and ADMM (Bregman) Yin et al. (2008)Data Assimilation.Higher-order methods (L-BFGS, inexact Newton) many other tools from scientific computing.Wright (UW-Madison)Optimization in LearningAugust 201342 / 60

III. Multicore Asynchronous MethodsStochastic gradient, coordinate descent: Aren’t these slow, simple, oldalgorithms?Often good for learning / big-data applications — Can make progressusing a small subset of data.Slow? Often a good fit for modern computers (multicore, NUMA,clusters) — parallel, asynchronous versions are possible.Simple? What’s wrong with that? There’s interesting new analysis,tied to plausible models of parallel computation and data.Old? Yes, but now they are retooled for asynchronousimplementation — new computational model and new analysis.“Asynchronicity is the key to speedup on modern architectures,” says BillGropp (SIAM CS&E Plenary, 2013).Wright (UW-Madison)Optimization in LearningAugust 201343 / 60

Parallel Stochastic GradientRecall the objective f (x) (1/m)Pfi (x), and basic (serial) SG step:xk 1 xk αk fik (xk ),where ik {1, 2, . . . , m} is chosen at random. We consider aconstant-step variant with αk α.Parallel versions tried:Dual Averaging (AIG): Average gradient estimates evaluated inparallel on different cores. Requires message passing /synchronization (Dekel et al., 2012; Duchi et al., 2010)Round-Robin (RR): Cores evaluate fi in parallel and updatecentrally stored x in round-robin fashion. Requires synchronization(Langford et al., 2009).Asynchronous: Hogwild!: Each core grabs the centrally-stored xand evaluates fi (x) for some random i, then writes the updatesback into x (Niu et al., 2011). Downpour SGD: Similar idea forcluster (Le et al., 2012).Wright (UW-Madison)Optimization in LearningAugust 201344 / 60

Asynchronous Stochastic Gradient: Hogwild!Hogwild!: Each processor runs independently (without synchronization):1Sample ik from {1, 2, . . . , m};2Read current state of x from central memory, evalute g : fik (x);3for nonzero components gv do xv xv αgv ;Updates can be “old” by the time they are applied, but we assumethat they are at most τ cycles old.Processors can overwrite each other’s work, but sparsity of the fihelps — updates don’t interfere too much with each other.Define quantities that capture the interconnectedness of the functions fi :ρi number of indices j such that fi and fj depend on overlappingcomponents of x.P2ρ̄ mi 1 ρi /m : average rate of overlapping subvectors.Wright (UW-Madison)Optimization in LearningAugust 201345 / 60

Hogwild! ConvergenceGiven (0, a0 L), and settingαk we have fork µ (1 2τ ρ̄)LM 2 m2(1 2τ ρ̄)LM 2 m2logµ2 2La0 1 thatmin E (f (xj ) f (x )) ,0 j kRecovers the sublinear 1/k convergence rate seen in regular SGD, with thedelay τ and overlap measure ρ both appearing linearly.(Niu et al., 2011; Richtarik, 2012)Wright (UW-Madison)Optimization in LearningAugust 201346 / 60

Hogwild! PerformanceHogwild! compared with averaged gradient (AIG) and round-robin (RR).Experiments run on a 12-core machine in 2011. (10 cores used for gradientevaluations, 2 cores for data shuffling.)Wright (UW-Madison)Optimization in LearningAugust 201347 / 60

Hogwild! PerformanceWright (UW-Madison)Optimization in LearningAugust 201348 / 60

Asynchronous Stochastic Coordinate Descent (ASCD)Consider min f (x), where f : Rn Rn is smooth and convex.Each processor (independently, without synchronization) does:1. Choose i {1, 2, . . . , n} uniformly at random;2. Read x and evaluate g [ f (x)]i ;3. Update xi xi γLmax g ;Here γ is a steplength (more below) and Lmax is a bound on the diagonalsof the Hessian 2 f (x).Assume that not more than τ cycles pass between when x is read (step 2)and updated (step 3).How to choose γ to achieve good convergence?Wright (UW-Madison)Optimization in LearningAugust 201349 / 60

Constants and “Diagonalicity”Several constants are critical to the analysis.τ : maximum delay;Lmax : maximum diagonal of 2 f (x);Lres : maximum row norm of Hessian;µ: lower bound on eigenvalues of 2 f (x) (assumed positive).The ratio Lres /Lmax is particularly important — it measures the degree ofdiagonal dominance in the Hessian 2 f (x) (Diagonalicity).By convexity, we have1 Lres n.LmaxCloser to 1 if Hessian is nearly diagonally dominant (eigenvectors close toprincipal coordinate axes). Smaller is better for parallelism.Wright (UW-Madison)Optimization in LearningAugust 201350 / 60

Diagonalicity IllustratedLeft figure is better. It can tolerate a higher delay parameter τ and thusmore cores working asynchronously.Wright (UW-Madison)Optimization in LearningAugust 201351 / 60

How to choose γ?Choose some ρ 1 and pick γ small enough to ensure thatρ 1 E(k f (xj 1 )k2 ) ρ.E(k f (xj )k2 )Not too much change in gradient over each iteration, so not too muchprice to pay for using old information, in the asynchronous setting.Choose γ small enough to satisfy this property but large enough to get alinear rate.Assuming that τ 1 and choosing ρ 1 nLmax,2eLres 2eLres ,nLmaxwe can take γ 1. Then have jµE(f (xj ) f ) 1 (f (x0 ) f ).2nLmax(Liu and Wright, 2013)Wright (UW-Madison)Optimization in LearningAugust 201352 / 60

ASCD DiscussionLinear rate is close to the rate attained by short-step steepest descent.Bound on τ is a measure of potential parallelization. When ratio Lres /Lmax is favorable, get τ O( n). Thus, expect near-linear speedup on to O( n) cores running asynchronously in parallel.Can extend algorithm and analysis to“Essentially strongly convex” and “weakly convex” cases;Separable constraints. Have τ O(n1/4 ) — less potential parallelism.Wright (UW-Madison)Optimization in LearningAugust 201353 / 60

Implemented on 4-socket, 40-core Intel XeonminxkAx bk2 0.5kxk2where A Rm n is a Gaussian random matrix (m 6000, n 20000,columns are normalized to 1). Lres /Lmax 2.2. Choose γ 1.10t 1t 2t 4t 8t 16t 321residual10010-110-210-340302520151010-410-5 0idealobserved35speedup1025510# epochs1520253000510threads152025303540(Thanks: C. Ré, V. Bittorf, S. Sridhar)Wright (U

I.Sketch somecanonical formulationsof data analysis / machine learning problemsas optimization problems. II.Anoptimization toolbox: Fundamental formulation and algorithmic techniques from optimization that are featuring strongly in data analysis. II I.Show how the optimization tools aremixed and matchedto address data analysis tasks.