European Journal Of Operational Research Characterization Of The . PDF Free Download

1y ago

24 Views

1 Downloads

1.06 MB

12 Pages

Report/dmca

Download PDF

Transcription

European Journal of Operational Research 270 (2018) 931–942Contents lists available at ScienceDirectEuropean Journal of Operational Researchjournal homepage: www.elsevier.com/locate/ejorCharacterization of the equivalence of robustiﬁcation andregularization in linear and matrix regressionRDimitris Bertsimas a, , Martin S. Copenhaver babSloan School of Management and Operations Research Center, MIT, United StatesOperations Research Center, MIT, United Statesa r t i c l ei n f oArticle history:Received 9 November 2016Accepted 20 March 2017Available online 28 March 2017Keywords:Convex programmingRobust optimizationStatistical regressionPenalty methodsAdversarial learninga b s t r a c tThe notion of developing statistical methods in machine learning which are robust to adversarial perturbations in the underlying data has been the subject of increasing interest in recent years. A common feature of this work is that the adversarial robustiﬁcation often corresponds exactly to regularizationmethods which appear as a loss function plus a penalty. In this paper we deepen and extend the understanding of the connection between robustiﬁcation and regularization (as achieved by penalization) inregression problems. Speciﬁcally,(a) In the context of linear regression, we characterize precisely under which conditions on the model ofuncertainty used and on the loss function penalties robustiﬁcation and regularization are equivalent.(b) We extend the characterization of robustiﬁcation and regularization to matrix regression problems(matrix completion and Principal Component Analysis). 2017 Elsevier B.V. All rights reserved.1. IntroductionThe development of predictive methods that perform well inthe face of uncertainty is at the core of modern machine learning and statistical practice. Indeed, the notion of regularization—loosely speaking, a means of controlling the ability of a statistical model to generalize to new settings by trading off with themodel’s complexity— is at the very heart of such work ( Hastie,Tibshirani, & Friedman, 2009). Corresponding regularized statisticalmethods, such as the Lasso for linear regression (Tibshirani, 1996)and nuclear-norm-based approaches to matrix completion (Candès& Recht, 2012; Recht, Fazel, & Parrilo, 2010), are now ubiquitousand have seen widespread success in practice.In parallel to the development of such regularization methods,it has been shown in the ﬁeld of robust optimization that undercertain conditions these regularized problems result from the needto immunize the statistical problem against adversarial perturbations in the data (Ben-Tal, Ghaoui, & Nemirovski, 2009; Caramanis, Mannor, & Xu, 2011; Ghaoui & Lebret, 1997; Xu, Caramanis, &Mannor, 2010). Such a robustiﬁcation offers a different perspectiveRCopenhaver is partially supported by the Department of Defense, Oﬃce ofNaval Research, through the National Defense Science and Engineering GraduateFellowship. Corresponding author.E-mail addresses: dbertsim@mit.edu (D. Bertsimas), mcopen@mit.edu(M.S. 03.0510377-2217/ 2017 Elsevier B.V. All rights reserved.on regularization methods by identifying which adversarial perturbations the model is protected against. Conversely, this can helpto inform statistical modeling decisions by identifying potentialchoices of regularizers. Further, this connection between regularization and robustiﬁcation offers the potential to use sophisticateddata-driven methods in robust optimization (Bertsimas, Gupta, &Kallus, 2013; Tulabandhula & Rudin, 2014) to design regularizersin a principled fashion.With the continuing growth of the adversarial viewpoint in machine learning (e.g. the advent of new deep learning methodologiessuch as generative adversarial networks (Goodfellow et al., 2014a;Goodfellow, Shlens, & Szegedy, 2014b; Shaham, Yamada, & Negahban, 2015)), it is becoming increasingly important to better understand the connection between robustiﬁcation and regularization.Our goal in this paper is to shed new light on this relationshipby focusing in particular on linear and matrix regression problems.Speciﬁcally, our contributions include:1. In the context of linear regression we demonstrate that ingeneral such a robustiﬁcation procedure is not equivalent toregularization (via penalization). We characterize preciselyunder which conditions on the model of uncertainty usedand on the loss function penalties one has that robustiﬁcation is equivalent to regularization.2. We break new ground by considering problems in the matrix setting, such as matrix completion and Principal Component Analysis (PCA). We show that the nuclear norm, a

932D. Bertsimas, M.S. Copenhaver / European Journal of Operational Research 270 (2018) 931–942Table 1Matrix norms on1. The p-Frobenius norm, denotednorm on the entries of : R m n hatten)σpInduced(h, g)ij μ( )maxβij ppg( β )h( β )EntrywiseppFpnorm: ijnorm on the singular valuesInduced by norms g, hThe structure of the paper is as follows. In Section 2, we review background on norms and consider robustiﬁcation and regularization in the context of linear regression, focusing both on theirequivalence and non-equivalence. In Section 3, we turn our attention to regression with underlying matrix variables, considering indepth both matrix completion and PCA. In Section 4, we includesome concluding remarks.(h,g)2.1. Norms and their dualsIn this section, we introduce the necessary background onnorms which we will use to address the equivalence of robustiﬁcation and regularization in the context of linear regression. Givena vector space V Rn we say that · : V R is a norm if for allv, w V and α R1. If v 0, then v 0,2. α v α v (absolute homogeneity), and3. v w v w (triangle inequality).If · satisﬁes conditions 2 and 3, but not 1, we call it a seminorm. For a norm · on Rn we deﬁne its dual, denoted · , tobex Rxβ,xwhere x denotes the transpose of x (and therefore x β is the usualinner product). For example, the p norms β p : ( i β i p )1/p forp [1, ) and β : max i β i satisfy a well-known duality relation: p is dual to p , where p [1, ] with 1/ p 1/p 1. Wecall p the conjugate of p. More generally for matrix norms1 · onRm n the dual is deﬁned analogously: : maxA R m nA,A,where Rm n and ·,· denotes the trace inner product:A, Tr(A ), where A denotes the transpose of A. We notethat the dual of the dual norm is the original norm (Boyd & Vandenberghe, 2004).Three widely used choices for matrix norms (see Horn & Johnson, 2013) are Frobenius, spectral, and induced norms. The deﬁnitions for these norms are given below for R m n and summarized in Table 1 for convenient reference.1We treat a matrix norm as any norm on R m n which satisﬁes the three conditions of a usual vector norm, although some authors reserve the term “matrixnorm” for a norm on R m n which also satisﬁes a submultiplicativity condition (seeHorn and Johnson, 2013, pg. 341). ij p.μ( ) p ,g( β ).h(β ): maxβ RnAn important special case occurs when g p and h q .When such norms are used, (q, p) is used as shorthand todenote ( q , p ). Induced norms are sometimes referred to asoperator norms. We reserve the term operator norm for theinduced norm ( 2 , 2 ) (2, 2) σ , which measures thelargest singular value.2. A robust perspective of linear regression: maxnpwhere μ( ) denotes the vector containing the singular values of . Again, σ p is dual to σ p.3. Finally we consider the class of induced norms. If g : Rm R and h : Rn R are norms, then we deﬁne the inducednorm · (h, g) aspopular penalty function used throughout this setting, arisesdirectly through robustiﬁcation. As with the case of vectorregression, we characterize under which conditions on themodel of uncertainty there is equivalence of robustiﬁcationand regularization in the matrix setting. is the entrywiseAnalogous to before, F p is dual to Fp , where 1/p 1/p 1.2. The p-spectral (Schatten) norm, denoted · σp , is the pnorm on the singular values of the matrix :σ p : βFp ,1/p1/pp-Frobenius·2.2. Uncertain regressionWe now turn our attention to uncertain linear regression problems and regularization. The starting point for our discussion is thestandard problemmin g(y Xβ),β Rnwhere y Rm and X Rm n are data and g is some convex function, typically a norm. For example, g 2 is least squares, whileg 1 is known as least absolute deviation (LAD). In favor of models which mitigate the effects of overﬁtting these are often replaced by the regularization problemmin g(y Xβ) h(β ),βwhere h : Rn R is some penalty function, typically taken to beconvex. This approach often aims to address overﬁtting by penalizing the complexity of the model, measured as h(β ). (For a moreformal treatment using Hilbert space theory, (see Bauschke & Combettes, 2011; Bousquet, Boucheron, & Lugosi, 2004). For example,taking g 22 and h 22 , we recover the so-called regularized leastsquares (RLS), also known as ridge regression (Hastie et al., 2009).The choice of g 22 and h 1 leads to Lasso, or least absoluteshrinkage and selection operator, introduced in Tibshirani (1996).Lasso is often employed in scenarios where the solution β is desired to be sparse, i.e., β has very few nonzero entries. Broadlyspeaking, regularization can take much more general forms; forour purposes, we restrict our attention to regularization that appears in the penalized form above.In contrast to this approach, one may alternatively wish to reexamine the nominal regression problem minβ g(y Xβ ) and instead attempt to solve this taking into account adversarial noise inthe data matrix X. As in Ghaoui and Lebret (1997), Lewis (2002),Lewis and Pang (2009) , Ben-Tal et al. (2009), Xu et al. (2010), thisapproach may take the formmin max g(y (X β U)β ),(1)where the set U Rm n characterizes the user’s belief aboutuncertainty on the data matrix X . This set U is known in the

D. Bertsimas, M.S. Copenhaver / European Journal of Operational Research 270 (2018) 931–942language of robust optimization (Ben-Tal et al., 2009; Bertsimas,Brown, & Caramanis, 2011) as an uncertainty set and the innermaximization problem max U g(y (X ) β ) takes into accountthe worst-case error (measured via g) over U . We call such aprocedure robustiﬁcation because it attempts to immunize orrobustify the regression problem from structural uncertainty inthe data. Such an adversarial or “worst-case” procedure is one ofthe key tenets of the area of robust optimization (Ben-Tal et al.,2009; Bertsimas et al., 2011).As noted in the introduction, the adversarial perspective offersseveral attractive features. Let us ﬁrst focus on settings whenrobustiﬁcation coincides with a regularization problem. In sucha case, the robustiﬁcation identiﬁes the adversarial perturbationsthe model is protected against, which can in turn provide additional insight into the behavior of different regularizers. Further,technical machinery developed for the construction of data-drivenuncertainty sets in robust optimization (Bertsimas et al., 2013;Tulabandhula & Rudin, 2014) enables the potential for a principled framework for the design of regularization schemes, in turnaddressing a complex modeling decision encountered in practice.Moreover, the adversarial approach is of interest in its ownright, even if robustiﬁcation does not correspond directly to a regularization problem. This is evidenced in part by the burgeoningsuccess of generative adversarial networks and other methodologies in deep learning (Goodfellow et al., 2014a; Goodfellow et al.,2014b; Shaham et al., 2015). Further, the worst-case approach oftenleads to a more straightforward analysis of properties of estimators(Xu et al., 2010) as well as algorithms for ﬁnding estimators (BenTal, Hazan, Koren, & Mannor, 2015).Let us now return to the robustiﬁcation problem. A naturalchoice of an uncertainty set which gives rise to interpretability isthe set U { Rm n : λ}, where · is some matrix normand λ 0. One can then write max U g(y (X ) β ) asg(y Xβ )maxXs. t.X X λ,or the worst case error taken over all X suﬃciently close to thedata matrix X. In what follows, if · is a norm or seminorm, thenwe let U · denote the ball of radius λ in · :U· {: λ}.For example, UFp , Uσ p , and U(h,g) denote uncertainty sets under thenorms Fp , σ p , and (h, g), respectively. We assume λ 0 ﬁxed forthe remainder of the paper.We brieﬂy mention addressing uncertainty in y. Suppose thatwe have a set V R m which captures some belief about the uncertainty in y. If again we have an uncertainty set U Rm n , wemay attempt to solve a problem of the formmin max g(y δ (X βδ V)β ). UWe can instead work with a new loss function ḡ deﬁned asIf g is convex, then so is ḡ . In this way, we can work with theproblem in the formβ URelation to error-in-variable modelsAnother class of statistical models which are particularly relevant for the work contained herein are error-in-variable models(Carroll, Ruppert, Stefanski, & Crainiceanu, 2006). One approach tosuch a problem takes the formminβ R n , Rm ng(y (X )β ),where there is only uncertainty in X. Throughout the remainder ofthis paper we will only consider such uncertainty.Relation to robust statisticsThere has been extensive work in the robust statistics community on statistical methods which perform well in noisy, real-world)β ) P( ),where P is a penalty function which takes into account the complexity of possible perturbationsto the data matrix X. A canonical example of such a method is total least squares (Golub & Loan,1980; Markovsky & Huffel, 2007), which can be written for ﬁxed τ 0 asβ,δ Vmin max ḡ(y (X environments. As noted in Ben-Tal et al. (2009) , the connection between robust optimization and robust statistics is not clear. Wedo not put forth any connection here, but brieﬂy describe the development of robust statistics to appropriately contextualize ourwork. Instead of modeling noise via a distributional perspective,as is often the case in robust statistics, in this paper we choose tomodel it in a deterministic way using uncertainty sets. For a comprehensive description of the theoretical developments in robuststatistics in the last half century, see the texts (Huber & Ronchetti,2009; Rousseeuw, 1984) and the surveys (Hubert, Rousseeuw, &Aelst, 2008; Morgenthaler, 2007).A central aspect of work in robust statistics is the development and use of a more general set of loss functions. (This is incontrast to the robust optimization approach, which generally results in the same nominal loss function with a new penalty; seeSection 2.3 below.) For example, while least squares (the 2 loss) isknown to perform well under Gaussian noise, it does not performwell under other types of noise, such as contaminated Gaussiannoise. (Indeed, the Gaussian distribution was deﬁned so that leastsquares is the optimal method under Gaussian noise ( Rousseeuw,1984).) In contrast, a method like LAD regression (the 1 loss) generally performs better than least squares with errors in y, but notnecessarily errors in the data matrix X.A more general class of such methods is M-estimators as proposed in Huber (1973) and since studied extensively (Huber &Ronchetti, 2009; Rousseeuw & Leroy, 1987). However, M-estimatorslack desirable ﬁnite sample breakdown properties; in short, Mestimators perform very poorly in recovering the loadings β under gross errors in the data (X, y). To address some of theseshortcomings, GM-estimators were introduced (Hampel, 1974; Hill,1977; Mallows, 1975). Since these, many other estimators havebeen proposed. One such method is least quantile of squares regression (Rousseeuw, 1984) which has highly desirable robustnessproperties. There has been signiﬁcant interest in new robust statistical methods in recent years with the increasing availability oflarge quantities of high-dimensional data, which often make reliable outlier detection diﬃcult. For commentary on modern approaches to robust statistics, see (Bradic, Fan, & Wang, 2011; Fan,Fan, & Barut, 2014; Hubert et al., 2008) and references therein.min y (X ḡ(v) : max g(v δ).933)β2 τF.An equivalent way of writing such problems is, instead of penalized form, as constrained optimization problems. In particular,the constrained version generically takes the formmin min g(y (X β:P ( ) η)β ),(2)where η 0 is ﬁxed. Under the representation in (2), the comparison with the robust optimization approach in (1) becomesimmediate. While the classical error-in-variables approach takesan optimistic view on uncertainty in the data matrix X, andﬁnds loadings β on the new “corrected” data matrix X , the

934D. Bertsimas, M.S. Copenhaver / European Journal of Operational Research 270 (2018) 931–942minimax approach of (1) considers protections against adversarialperturbations in the data which maximally increase the loss.One of the advantages of the adversarial approach to errorin-variables is that it enables a direct analysis of certain statistical properties, such as asymptotic consistency of estimators (c.f.Caramanis et al., 2011; Xu et al., 2010). In contrast, analyzing theconsistency of estimators attained by a model such as total leastsquares is a complex issue (Kukush, Markovsky, & Huffel, 2005).2.3. Equivalence of robustiﬁcation and regularizationCorollary 1 (Ben-Tal et al., 2009; Caramanis et al., 2011; Xu et al.,2010). If p, q [1, ] thenmin maxβ U(q, p)Theorem 1. If g : Rm R is a seminorm which is not identicallyzero and h : Rn R is a norm, then for any z R m and β R nmax g(z U(h,g)β) g(z) λh(β ),where U (h,g) {:(h,g) λ}.Proof. From the triangle inequality g(z β ) g(z ) g( β ) g(z ) λh ( β ) for any U : U(h,g) . We next show that there exists some U so that g( z β ) g(z ) λh ( β ) . Let v Rn sothat v argmaxh (v ) 1 v β, where h is the dual norm of h. Notein particular that v β h( β ) by the deﬁnition of the dual normh . For now suppose that g(z) 0. Deﬁne the rank one matrix g(λz ) zv . Observe thatg(z β) g z We next show thatg( x) gλh(β )g(z) λh(β )z g(z) g(z ) λh(β ).g(z)g(z) U. Observe that for any x Rn thatλv xz λ v x λh(x )h (v) λh(x),g(z)where the ﬁnal inequality follows by deﬁnition of the dual norm.Hence U, as desired.We now consider the case when g(z ) 0. Let u Rm so thatg(u ) 1 (because g is not identically zero there exists some u sothat g(u) 0, and so by homogeneity of g we can take u so that λuv . We observeg(u ) 1). Let v be as before. Now deﬁnethatg(z β ) g(z λuv β ) g(z ) λ v β g(u) λh(β ).Now, by the reverse triangle inequality,g(z β ) g( β ) g(z) g( β ) λh(β ),and therefore g( z β ) λh (β ) g(z ) λh ( β ). The proof that U is identical to the case when g(z) 0. This completes theproof.This result implies as a corollary known results on the connection between robustiﬁcation and regularization as found in Xuet al. (2010), Ben-Tal et al. (2009), Caramanis et al. (2011) and references therein.p min y Xβpβ λβ q.In particular, for p q 2 we recover regularized least squares as arobustiﬁcation; likewise, for p 2 and q 1 we recover the Lasso. 2Theorem 2 (Ben-Tal et al., 2009; Caramanis et al., 2011; Xu et al.,2010). One has the following for any p, q [1, ]:min max y (X βA natural question is when do the procedures of regularization and robustiﬁcation coincide. This problem was ﬁrst studiedin Ghaoui and Lebret (1997) in the context of uncertain leastsquares problems and has been extended to more general settingsin Caramanis et al. (2011); Xu et al. (2010) and most comprehensively in Ben-Tal et al. (2009) . In this section, we present settingsin which robustiﬁcation is equivalent to regularization. When suchan equivalence holds, tools from robust optimization can be usedto analyze properties of the regularization problem (c.f. Caramaniset al., 2011; Xu et al., 2010).We begin with a general result on robustiﬁcation under inducedseminorm uncertainty sets.)βy (X UF p)βp min y Xββp λβ2 λβ 2.p ,where p is the conjugate of p. Similarly,min max y (X β Uσq)β min y Xβ2βObserve that regularized least squares arises again under alluncertainty sets deﬁned by the spectral norms σ q when the lossfunction is g 2 . Now we continue with a remark on how Lassoarises through regularization. See Xu et al. (2010) for comprehensive work on the robustness and sparsity implications of Lasso asinterpreted through such a robustiﬁcation considered in this paper.Remark 1. As per Corollary 1 it is known that Lasso arises as uncertain 2 regression with uncertainty set U : U(1,2 ) (Xu et al., 2010).As with Theorem 1, one might argue that the 1 penalizer arises asan artifact of the model of uncertainty. We remark that one can derive the set U as an induced uncertainty set deﬁned using the “true”non-convex penalty 0 , where β 0 : {i: β i 0} . To be precise, forany p [1, ] and for {β R n : β p 1} we claim thatU : : maxβ β 2 λβ 0satisﬁes U U . This is summarized, with an additional representationU as used in Xu et al. (2010), in the following proposition.βProposition 1. If U U(1,2 ) , U { :1} for an arbitrary p [1, ], and U { :, then U U U .i is the ith column of2 λ β 0 β p i 2 λ i}, whereProof. We ﬁrst show that U U . Because β 1 β 0 for all β U .Rn with β p 1, we have that U U . Now suppose thatThen for any β Rn , we have thatβ2 β i eii 2i βi ei2 i βi λ λ β 1 ,where {ei } ni 1 is the standard orthonormal basis for Rn . Hence, U and therefore U U . Combining with the previous direction gives U U .We now prove that U U . That U U is essentially obvious; U U follows by considering β {e i }ni 1 . This completes theproof.This proposition implies that 1 arises from the robustiﬁcationsetting without directly appealing to standard convexity argumentsfor why 1 should be used to replace 0 (which use the fact thatn1 is the so-called convex envelope of 0 on [ 1, 1] , see e.g. Boydand Vandenberghe (2004).In light of the above discussion, it is not diﬃcult to show thatother Lasso-like methods can also be expressed as an adversarial2 Strictly speaking, we recover equivalent problems to regularized least squaresand Lasso, respectively. We take the usual convention and overlook this technicality(see Ben-Tal et al., 2009 for a discussion). For completeness, we note that one canwork directly with the true 22 loss function, although at the cost of requiring morecomplicated uncertainty sets to recover equivalence results.

935D. Bertsimas, M.S. Copenhaver / European Journal of Operational Research 270 (2018) 931–942robustiﬁcation, supporting the ﬂexibility and versatility of such anapproach. One such example is the elastic net (De Mol, De Vito,& Rosasco, 2009; Mosci, Rosasco, Santoro, Verri, & Villa, 2010; Zou& Hastie, 2005), a hybridized version of ridge regression and theLasso. An equivalent representation of the elastic net is as follows:min y Xβ2β λβ1 μβ 2.As per Theorem 2, this can be written exactly asminβmaxy (X :, )β2.βUnder this interpretation, we see that λ and μ directly control thetradeoff between two different types of perturbations: “featurewise” perturbations(controlled via λ and the F norm) and“global” perturbations(controlled via μ and the F2 norm).We conclude this section with another example of when robustiﬁcation is equivalent to regularization for the case of LAD ( 1 )and maximum absolute deviation ( ) regression under row-wiseuncertainty.: δiTheorem 3 (Xu et al., 2010). Fix q [1, ] and let U {λ i}, where δ i is the ith row of Rm n. Thenβ U)β1 min y Xβ1β mλβq q β U)β min y Xββ λβq .)β UF qp,where UFq { R m n :Fq λ}. In the case when p q wesaw earlier (Theorem 2) that one exactly recovers p regressionwith an p penalty:)βmin max y (X βp UF p min y Xβpβ λβp .Let us now consider the case when p q. We claim that regularization (with h) is no longer equivalent to robustiﬁcation (with U Fq )unless p {1, }. Applying Proposition 2, one has for any z R mthatmax z β U Fqpwhere h maxandmin max y (X When equality is attained for all pairs (z, β ) Rm R n , we arein the regime of the previous section, and we say that robustiﬁcation under U is equivalent to regularization under h. We nowdiscuss a variety of explicit settings in which regularization onlyprovides upper and lower bounds to the true robustiﬁed problem.Fix p, q [1, ]. Consider the robust p regression problemmin max y (X λF2 μF min max y (X and hence the triangle inequality is satisﬁed. Therefore, h is aseminorm which satisﬁes the desired properties, completing theproof. zp h(β ),β U Fqis a norm (when p q, this is pre-pcisely the p norm, multiplied by λ). Here we can compute h. Todo this we ﬁrst deﬁne a discrepancy function as follows:For completeness, we note that the uncertainty set U { :q λ i} considered in Theorem 3 is actually an induced uncertainty set, namely, U U (q , ) .Deﬁnition 1. For a, b [1, ] deﬁne the discrepancy functionδ m (a, b) as2.4. Non-equivalence of robustiﬁcation and regularizationThis discrepancy function is computable and well-known (seee.g. Horn & Johnson, 2013):δiIn contrast to previous work studying robustiﬁcation for regression, which primarily addresses tractability of solving the new uncertain problem (Ben-Tal et al., 2009) or the implications for Lasso(Xu et al., 2010), we instead focus our attention on characterizationof the equivalence between robustiﬁcation and regularization. Webegin with a regularization upper bound on robustiﬁcation problems.Proposition 2. Let U R m n be any non-empty, compact set and g :Rm R a seminorm. Then there exists some seminorm h : Rn R sothat for any z Rm , β Rn , UProof. Let h : Rn R be deﬁned asβ ). UTo show that h is a seminorm we must show it satisﬁes absolute homogeneity and the triangle inequality. For any β Rn andα R, U U α g( β ) α max g( β ) U α h(β ),so absolute homogeneity is satisﬁed. Similarly, if β , γ Rn ,h(β γ ) max g( (β γ )) max g( U U ub 1}.if a bif a b.1,It satisﬁes 1 δ m (a, b) m and δ m(a, b) is continuous in a and b.One has that δ m(a, b ) δ m (b, a ) 1 if and only if a b (so longas m 2). Using this, we now proceed with the theorem. Theproof applies basic tools from real analysis and is contained inAppendix A.Theorem 4. UFqh(αβ ) max g( (αβ )) max: u R m,1/a 1/b,δm (a, b) mmax z with equality when z 0.h(β ) : max g(a(a) For any z R m and β Rn ,β ) g(z) h(β ),max g(z δm (a, b) : max{ umax g( Uβ ) g( γ )β ) max g( γ ) , Uβp zp λδ m ( p, q )β(3)q .(b) When p {1, }, there is equality in (3) for all (z, β ).(c) When p (1, ) and p q, for any β 0 the set of z R mfor which the inequality (3) holds at equality is a ﬁnite unionof one-dimensional subspaces (so long as m 2). Hence, forany β 0 the inequality in (3) is strict for almost all z .(d) For p (1, ), one has for all z Rm and β Rn thatzp λβδm (q, p)q max z UF q(4)β p.(e) For p (1, ), the lower bound in (4) is best possible in thesense that the gap can be arbitrarily small, i.e., for any β Rn ,inf max z z U Fqβp zp λβδm (q, p)q 0.

936D. Bertsimas, M.S. Copenhaver / European Journal of Operational Research 270 (2018) 931–942Theorem 4 characterizes precisely when robustiﬁcation underUFq is equivalent to regularization for the case of p regression. Inparticular, when p q and p (1, ), the two are not equivalent,and one only has thatλβ q min max y (X )β UF qδm (q, p)β min y Xβ p λδ m ( p, q ) β q .min y Xββp pβFurther, we have shown that these upper and lower bounds arethe best possible ( Theorem 4, parts (c) and (e)). While p regressionwith uncertainty set UF q for p q and p (1, ) still has bothupper and lower bounds which correspond to regularization (withdifferent regularization parameters λ [λ/δm (q, p), λδ m( p, q ) ]), weemphasize that in this case there is no longer the direct connectionbetween the parameter garnering the magnitude of uncertainty (λ)and the parameter for regularization (λ).Example 1. As a concrete example, consider the implications ofTheorem 4 when p 2 and q . We have thatmin y Xββ2 λβ1 min max y (X β UF min y Xββ2 )ββ 1.β ) g(z) λh(β ).Theorem 5. In a setting when robustiﬁcation and regularization arenot equivalent, it is possible for the two problems to have differentoptimal solutions. In particular,β argmin max g(y (X )β )is not necessarily a solution offor any λ 0, and vice versa.As a result, when robustiﬁcation and regularization do not coincide, they can induce structurally distinct solutions. In other words,the regularization path (as λ (0, ) varies) and the robustiﬁcation path (as the radius λ (0, ) of U varies) can be different.We now proceed to analyze another setting in which robustiﬁcation is not equivalent to regularization. The setting, in line withTheorem 2, is p regression under spectral uncertainty sets Uσq . Asper Theorem 2, one has that)βzp 2 min y Xβββ 2. λδ m ( p, 2 )(5)λβδ m (2, p)2 max z β p, U σqProof. This result is Theorem 4 in disguise. This follows by notingthatmax z ββ max z p Uσqp U F2and directly applying the preceding results.We now consider a third setting for p regression, this timesubject to uncertainty U (q,r ) ; this is a generalized version of theproblems considered in Theorems 1 and 3. From Theorem 1 weknow that if p r, thenmin max U(q, p))βy (X min y Xβpβp λβ q.Similarly, as per Theorem 3, when r and p {1, },min maxβ U(q, )y (X )βp minβy Xβp λδ m ( p, ) β q.Given these results, it is natural to inquire what happens for moregeneral choices of induced uncertainty set U(q,r ) . As before withTheorem 4, we have a complete characterization of th

shrinkage and selection operator, introduced in Tibshirani (1996). a so if te n mp lyd c w hr u βis de- sire dtbp a, .βhv yfw n oz Bl speaking, regularization can take much more general forms; for our purposes, we restrict our attention to regularization that ap- pears in the penalized form above.