An Introduction To Optimal Transport

Transcription

Weierstrass Institute forApplied Analysis and StochasticsAn Introduction to Optimal TransportPavel DvurechenskyPartially based on joint work with Darina Dvinskikh, Alexander Gasnikov, AlexeyKroshnin, Mathias Liero, Nazarii Tupitsa, Cesar UribeHSE-Yandex autumn school on generative models, Moscow, 26 - 29 November, 2019Mohrenstrasse 39 · 10117 Berlin · Germany · Tel. 49 30 20372 0 · www.wias-berlin.de26-27.11.2019

Content1Introduction2Application examples3Numerical methods for OT distance4OT barycentersOptimal Transport · 26-27.11.2019 · Page 2 (103)

1Introduction2Application examples3Numerical methods for OT distance4OT barycentersOptimal Transport · 26-27.11.2019 · Page 3 (103)

Monge ProblemRR f (x), g(y) 0 s.t. C(x, y) : Rd Rd R – transportation cost function;Rdf (x)dx Rdg(y)dy 1 – mass distributions;: Rd Rd , s.t. A Rd ,g(y)dy T 1 (A) f (x)dx minimizingAZC(x, T (x))f (x)dx. Goal: transport map TRRRdG. Monge, Mémoire sur la théorie des déblais et des remblais, 1781.Optimal Transport · 26-27.11.2019 · Page 4 (103)

Monge Problem (E, D) – metric space; C(x, y) : E E R – transportation cost function; µ, ν P2 (E) – measures to be transported; transport map T: E E , s.t. B , µ(T 1 (B)) ν(B) ν T# µ.ZinfC(x, T (x))µ(dx).TEG. Monge, Mémoire sur la théorie des déblais et des remblais, 1781.Optimal Transport · 26-27.11.2019 · Page 5 (103)

References Filippo Santambrogio Optimal Transport for Applied Mathematicians Cedric Villani Topics in Optimal Transport Optimal Transport: Old and New Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré Gradient Flows in MetricSpaces and in the Space of Probability Measures Gabriel Peyré and Marco Cuturi Computational Optimal TransportOptimal Transport · 26-27.11.2019 · Page 6 (103)

Drawbacks of Monge formulation Highly non-linear constraintZZg(y)dy Af (x)dx.T 1 (A)Change of variable leads to nonlinear PDEg(T (x))det(DT (x)) f (x).If T (x) u(x), we obtain a Monge-Ampere equationdetD2 u(x) f (x).g( u(x))NB: One of the ways to find the OT is to solve this equation. The mass can not be splitted and the solution does not exist in some situationsImage courtesy: Mathias Liero.Optimal Transport · 26-27.11.2019 · Page 7 (103)

Kantorovich’s relaxationInstead of Transport map T : E E , consider transport plans π P(E E).π(x, y) – amount of mass transported from x to y .Constraints become linear U(µ, ν) ZZπ P(E E) :π(x, y)dy µ(x),E π(x, y)dx ν(y) .EZinfC(x, y)dπ(x, y).π U (µ,ν)E E Linear objective and linear constraints, but larger dimension. Feasible set is compact in weak topology and solution exists. Probabilistic interpretationminπ:(X,Y ) πEπ [C(X, Y )], s.t. X µ, Y ν.L. Kantorovich, On the transfer of masses, 1942.Optimal Transport · 26-27.11.2019 · Page 8 (103)

Kantorovich’s relaxationImage: Gabriel Peyré and Marco Cuturi. Computational Optimal Transport. Solves mass splitting problem in Monge formulation Monge transport maps are covered with transport plans of the formπT (Id, T )# µ, i.e. πT (A B) µ({x A : T (x) B}).Optimal Transport · 26-27.11.2019 · Page 9 (103)

Non-uniqueness of transport planImage courtesy: Mathias Liero.Optimal Transport · 26-27.11.2019 · Page 10 (103)

Monge-Kantorovich (Wasserstein) distanceE Rd , cost function C(x, y) kx ykp , p [1, ) leads toMonge-Kantorovich (Wasserstein) distances(Wp (µ, ν))p Zinfπ U (µ,ν)kx ykp dπ(x, y).E EWp is a distance on the space of probability measures with finite p-th momentsPp (Rd ). Topology is equivalent to weak convergence and convergence of p-thmoments. Has Riemannian structure (Benamou-Brenier formulation, Otto calculus). Geodesic curves as shortest connecting curves between two probabilitymeasures. For p 1 is known as Earth mover’s distance.Optimal Transport · 26-27.11.2019 · Page 11 (103)

Monge-Kantorovich (Wasserstein) distance(Wp (µ, ν))p Zinfπ U (µ,ν)kx ykp dπ(x, y).E E Natural distance between probability measures and a tool for comparing them. Gives rise to a topological space with Riemannian structure. “Horizontal” distance as opposed to Lp distancesImage courtesy: Mathias Liero. In ML we need to find a probability measure which approximates the datadistribution. We can use this distance as a regularizer.Optimal Transport · 26-27.11.2019 · Page 12 (103)

Kantorovich duality Z 12C(x, y)dπ(x, y) : π# µ, π# ν .minπE ELagrange multipliers ξ : E R, η : E R (Kantorovich potentials) Zmin Zπ M(E E) supξ(x)dµ E sup2η(y)d(ν π#)E (C(x, y) ξ(x) η(y))dπ(x, y)η(y)dν infπEE E Zη(y)dν : C(x, y) ξ(x) η(y) 0ξ(x)dµ EEZ ZZ Zξ,ηξ,ηE E Zξ,η1ξ(x)d(µ π#) C(x, y)dπ(x, y) supEEconomic interpretation: Outsourcing transport to a vendor. ξ(x) vendor price for transportation of a unit mass from x; η(y) vendor price for transportation of a unit mass to y ; Vendor prices are reasonable if ξ(x) η(y) Vendor maximizes the profit.Optimal Transport · 26-27.11.2019 · Page 13 (103) C(x, y);

Connection between Monge and Kantorovich problemsTheorem [Brenier, 1991]. Assume: Let E Rd , C(x, y) kx yk2 . One of the measures (say, µ) has a density w.r.t. Lebesgue measure.Then: Optimal transport plan π in the Kantorovich formulation is unique and issupported on the graph (x, T (x)) of a transport map in the Monge formulation. π (Id, T )# µ, i.e.Z h C(E E),Zh(x, y)dπ(x, y) E Eh(x, T (x))dµ(x).E T is uniquely defined as the gradient of a convex function ϕ, T (x) ϕ(x),where ϕ(x) is unique (up to an additive constant) convex function s.t.( ϕ)# µ ν . ϕ(x) kxk22 ξ(x), ξ – optimal dual solution.Optimal Transport · 26-27.11.2019 · Page 14 (103)

Distance between GaussiansAssume that µ N (mµ , Σµ ), ν N (mν , Σν ) in Rd . Let 1 211 1 1Σµ 2 . NB: Σµ Σν A I .A Σµ 2 Σµ2 Σν Σµ2T : x mν A(x mµ ) is s.t. T# ρµ ρν . 1T (x) hx mµ , A(x mµ )i hmν , xi .2 1/2 1/2W22 (µ, ν) kmµ mν k2 tr Σµ Σν 2(Σ1/2.µ Σν Σ µ )Image: Gabriel Peyré and Marco Cuturi. Computational Optimal Transport.Optimal Transport · 26-27.11.2019 · Page 15 (103)

Continuous formulation(W2 (µ0 , µ1 ))2 min ZπRd Rd 12kx yk2 dπ(x, y) : π# µ0 , π# µ1 .Benamou-Brenier “Dynamical” equivalent formulation: Consider connecting curvest µt Z21Z(W2 (µ0 , µ1 )) minπ0RdImage courtesy: Mathias Liero.Optimal Transport · 26-27.11.2019 · Page 16 (103)kvt (x)k2 dµt dt : µt 0 µ0 , µt 1 µ1 , dµt div(vt µt ) 0 .dt

Discrete case xi Rd , i 1, ., n – support of µ;yi Rd , i 1, ., n – support of ν ;Pna Sn (1); µ i 1 ai δ(xi ),Pnb Sn (1); ν j 1 bj δ(yj ), Cij C(xi , yj ),i, j 1, ., n – ground costmatrix; Xij π(xi , yj ),i, j 1, ., n – transportationplan;Optimal transport problemmin hC, Xi,X U (a,b)U(a, b) : {X Rn n: X1 a, X T 1 b}. Optimal Transport · 26-27.11.2019 · Page 17 (103)

Kantorovich duality Tmin hC, Xi : X Rn n , X1 a, X 1 b .Lagrange multipliers ξ, η Rn (Kantorovich potentials)minX Rn n no hC, Xi max hξ, a X1i hη, b X T 1iξ,η( max hξ, ai hη, bi ξ,ηminX Rn n nohC ξ1T 1µT , Xi) max {hξ, ai hη, bi : Ci,j ξi ηj 0}ξ,η()nX max hξ, ai bi min {Ci,j ξi }ξi 1j 1,.,n(1)Optimal Transport · 26-27.11.2019 · Page 18 (103)

Semi-discrete caseyi Rd , i 1, ., n – support of ν ;Pnb Sn (1); ν j 1 bj δ(yj ), Cj (x), j 1, ., n – cost function; πj (x), j 1, ., n – transportation plan;minπ n Z X j 1Rdcj (x)dπj (x) :nXZπj (x) µ,j 1Optimal Transport · 26-27.11.2019 · Page 19 (103)dπj (x) bj , j 1, ., nRd .

Kantorovich dualityminπ n Z X cj (x)dπj (x) :Rdj 1nXZπj (x) µ,dπj (x) bj , j 1, ., nRdj 1 Lagrange multipliers ξ : Rd R, η Rn (Kantorovich potentials)min( n ZXπcj (x)dπj (x)Rdj 1(Zξ(x)d µ supRdξ,ηξ(x)dµ Rdξ,η!πj (x)nXηj bj infi 1πmin{Cj (x) ηj }dµ η RnRdjη Rn( n ZXj 1nXEX µ min{Cj (X) ηj } jOptimal Transport · 26-27.11.2019 · Page 20 (103)i 1dπj (x)Rd))(Cj (x) ξ(x) ηj )dπj (x)Rd)ηj bjnX )) Zη j bj i 1( supnXi 1(Z sup j 1(Z supnX)ηj bj.

1Introduction2Application examples3Numerical methods for OT distance4OT barycentersOptimal Transport · 26-27.11.2019 · Page 21 (103)

Building blocksP(E) – space of measures on EC(x, y) – transportation costyxE – basis spaceOptimal Transport · 26-27.11.2019 · Page 22 (103)

Image retrievalGoal: given an image, find similar in the databaseBasis space – CIELAB color spaceCost – Euclidean distance(short Euclidean distances correlate strongly withhuman color discrimination performance)Measures – histograms given by binsRubner, Tomasi, Guibas. The earth mover’s distance as a metric for image retrieval. 2000.https://en.wikipedia.org/wiki/Color histogramOptimal Transport · 26-27.11.2019 · Page 23 (103)

Image classificationGoal: classify images from MNIST datasetBasis space – pixel gridCost – Squared Euclidean distanceMeasures – histograms of pixel intensitiesRun standard SVM based on distance between imagesCuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NIPS 2013.Optimal Transport · 26-27.11.2019 · Page 24 (103)

Image color transferGoal: transfer color from one image to anotherBlondel, Seguy, Rolet. Smooth and Sparse Optimal Transport. AISTATS 2018.Optimal Transport · 26-27.11.2019 · Page 25 (103)

Image color transferBasis space – RGB color spaceCost – Squared Euclidean distanceMeasures – histograms given byclusteringBlondel, Seguy, Rolet. Smooth and Sparse Optimal Transport. AISTATS 2018.Optimal Transport · 26-27.11.2019 · Page 26 (103)

Barycenterx2x3x1x̂x4x5mx̂ m1 X1 Xxi arg minkx xi k22 .x mm i 1i 1Optimal Transport · 26-27.11.2019 · Page 27 (103)

Template image reconstructionGoal: reconstruct template from its random transformationsBasis space – pixel gridCost – Squared EuclideandistanceMeasures – histograms givenm1 XIb arg minDist(I, Ii ).I mi 1by intensity of pixelsEuclidean distanceOT distanceCuturi, Doucet. Fast Computation of Wasserstein Barycenters. ICML 2014.Optimal Transport · 26-27.11.2019 · Page 28 (103)

Shape interpolation f2 (µ0 , µ) tWf2 (µ, µ1 ) .µt arg min (1 t)W22µSolomon, de Goes, Peyré, Cuturi, Butscher, Nguyen, Du, Guibas Convolutional wassersteindistances: efficient optimal transportation on geometric domains. ACM Trans. Graph. 2015.Optimal Transport · 26-27.11.2019 · Page 29 (103)

1Introduction2Application examples3Numerical methods for OT distanceSinkhorn’s algorithmAccelerated gradient methodHomework4OT barycentersOptimal Transport · 26-27.11.2019 · Page 30 (103)

Content1Introduction2Application examples3Numerical methods for OT distanceSinkhorn’s algorithmAccelerated gradient methodHomework4OT barycentersIterative Bregman ProjectionsAccelerated gradient methodStochastic accelerated gradient methodOptimal Transport · 26-27.11.2019 · Page 31 (103)

Possible approachesb U(a, b)Find Xs.t.b hC, Ximin hC, Xi ε,X U (a,b)U(a, b) : {X Rn n: X1 a, X T 1 b}. Linear programming problem with complexity O(n3 ln n) arithmetic operations[Pele & Werman, 2009]. Widespread approach [Cuturi, 2013]. Solve by Sinkhorn’s algorithm anentropy-regularized optimal transport problemmin hC, Xi γhX, ln Xi.X U (a,b) Complexity by accelerated gradient descent and by Sinkhorn’s algorithm [D.,Gasnikov, Kroshnin, 2018], resp. eOOptimal Transport · 26-27.11.2019 · Page 32 (103)n2.5ε ,eOn2ε2 .

Entropy-regularized OTPrimal problemmin hC, Xi γhX, ln Xi X U (a,b)minX U (a,b) minCγ( hX, ln e γ i hX, ln Xi) CγKL X, e γ .X U (a,b)U(a, b) {X Rn n: X1 a, X T 1 b} Dual problemnX 1max γexp (Cij ξi ηj ) hξ, ai hη, biξ,ηγi,j 1Cf. with dual non-regularized problemmax{hξ, ai hη, bi Cij ξi ηj 0}ξ,ηNB: Regularization introduces error γhX, ln Xi [ γ ln(n2 ), 0] we need totake γ Θ(ε/ ln n).Optimal Transport · 26-27.11.2019 · Page 33 (103)

Sinkhorn’s algorithm: primal viewPrimal problem on CTγ X1 a,X1 b.minKLX,en nX R CγAlternating minimization/projection X 0 e X (k 1) arg min KL X, X (k) , X (k 2)X:X1 a argX:X T 1 bX (0)X (1)X (2)X(3)X1 aOptimal Transport · 26-27.11.2019 · Page 34 (103)minXT 1 b KL X, X (k 1)

Sinkhorn’s algorithm: dual viewDual problemmax γξ,ηnX 1exp (Cij ξi ηj ) hξ, ai hη, biγi,j 1 max γ ξ,ηeξγ TηCe γ e γ hξ, ai hη, biOptimality conditions (gradient equal to 0) ξηCe γ e γ e γ a η C T ξdiag e γe γeγ bdiag Alternating minimization in ξ, ηaξ (k 1) γ lne Cγeη (k)γOptimal Transport · 26-27.11.2019 · Page 35 (103)η (k 1) γ ln e Cγb Teξ(k 1)γ.

Sinkhorn’s algorithm: equivalent formulationsmax γ ξ,η γξ TCηe γ e γ hξ, ai hη, bi! ξ Tηξη Cγ e γ h , ai h , bimin e γeξ,ηeγγ T Cγγ ev hu, ai hv, bi γ min (eu ) eu,v γ min ψ(u, v) : 1T B(u, v)1 hu, ai hv, bi ,u,vwhere K : e C/γ and B(u, v) : diag eu K diag ev η (k)ξ (k 1) γ ln aKe γ . (k)u(k 1) ln a K ev.ũ(k 1) a K ṽ (k)Optimal Transport · 26-27.11.2019 · Page 36 (103) ξ(k 1)η (k 1) γ ln bKT e γ. . (k 1)v (k 1) ln b K T eu.ṽ (k 1) b K T ũ(k 1) .

Sinkhorn’s algorithm (cont.)Primal problemmin hC, Xi γhX, ln Xi,nominn ψ(u, v) : 1T B(u, v)1 hu, ai hv, bi ,X U (a,b)Dual problemu,v Rwhere K : e C/γ and B(u, v) : diag eu K diag ev Sinkhorn’s algorithm1:2:3:4:5:6:7:8:repeatif k mod 2 0 thenuk 1 uk ln(a/(B(uk , vk )1)), vk 1 vkelsevk 1 vk ln(b/(B(uk , vk )T 1)), uk 1 ukend ifk k 1until kB(uk , vk )1 ak1 kB(uk , vk )T 1 bk1 ε0Optimal Transport · 26-27.11.2019 · Page 37 (103)

Sinkhorn’s algorithm convergence rateBounds for the iterates and optimal solution [D., Gasnikov, Kroshnin, 2018]Denote R : ln ν mini,j {ai , bj } , ν : mini,j K ij e kCk /γ . Then maxi uik mini uik R and the same bounds hold for vk , u , v .Objective residual and constraints feasibility [D., Gasnikov, Kroshnin, 2018]e v) : ψ(u, v) ψ(u , v ). ThenDenote ψ(u, e k , vk ) R kB(uk , vk )1 ak1 kB(uk , vk )T 1 bk1 .ψ(uSinkhorn’s convergence rate [D., Gasnikov, Kroshnin, 2018]Sinkhorn’s algorithm requires no more thank 2 4Rε0iterations to find B(uk , vk ) s.t. kB(uk , vk )1 ak1 kB(uk , vk )T 1 bk1 ε0 .Optimal Transport · 26-27.11.2019 · Page 38 (103)

Proof sketchψ(uk , vk ) ψ(uk 1 , vk 1 ) h1, Bk 1i h1, Bk 1 1i huk 1 uk , ai hvk 1 vk , bi ha, uk 1 uk i ha, ln a ln(Bk 1)i KL(akBk 1)e k , vk ) ψ(ue k 1 , vk 1 ) KL (akBk 1)ψ(ue k , vk )e k 1 , vk 1 )ψ(uψ(u 22R2R222(e k , vk )2 (ε0 )2ψ(u,2R22!2e k , vk )ψ(u1 ,22Rk 12 kBk 1 rk1 max22where e 2R. Thus k 1 e 2R e 2R .ψ(u1 ,v1 )ψ(uk ,vk )ψ(u1 ,v1 )0 2e k m , vk m ) ψ(ue k , vk ) (ε ) m ,ψ(u2Optimal Transport · 26-27.11.2019 · Page 39 (103)k, m 0.).

Approximate OT by SinkhornRequire: Accuracy ε.2:εε0.Set γ 4 lnn , ε 8kCk 0ε0(a, b) n(8 εDefine (r̃, c̃) 1 ε80 ) (1, 1) .NB: mini,j {r̃ i , c̃j } ε0 /(8n).3:Calculate B(uk , vk ) by Sinkhorn’s algorithm with marginals r̃ , c̃ and accuracy4:ε0 /2.b as the projection of B(uk , vk ) on U(a, b) by Algorithm 2 in [AltschulerFind X1:et.al.,2017].Complexity of OT by Sinkhorn [D., Gasnikov, Kroshnin, 2018]b U(a, b) s.t. hC, Xib minX U (a,b) hC, Xi ε inAlgorithm outputs X On2 kCk2 ln nε2Optimal Transport · 26-27.11.2019 · Page 40 (103) arithmetic operations.

Projection on the transport polytope [Altschuler et.al.,2017]Require: Matrix Y to be projected, U(a, b).1:Set X diag(x), where xi min{ai /[Y 1]i , 1}.2:Set Y 0 XY .3:Set X diag(y), where yj min{bj /[(Y 0 )T 1]j , 1}.4:Set Y 00 Y 0 X .5:Set erra a Y 00 1, errb b (Y 00 )T 1.6:b Y 00 erra errT /kerra k1 .Output XbOptimal Transport · 26-27.11.2019 · Page 41 (103)

Content1Introduction2Application examples3Numerical methods for OT distanceSinkhorn’s algorithmAccelerated gradient methodHomework4OT barycentersIterative Bregman ProjectionsAccelerated gradient methodStochastic accelerated gradient methodOptimal Transport · 26-27.11.2019 · Page 42 (103)

Drawbacks of entropy regularization Blurring in the transportation plan. Dense transportation plan.Lower image: Blondel et al., 2017Optimal Transport · 26-27.11.2019 · Page 43 (103)

Our goals Better than O(n3 ln n) (LP solver) and O(n2 ln n/ε2 ) (Sinkhorn’s algorithm)complexity bound. Flexibility w.r.t. the choice of the regularizer, e.g. squared Euclidean norminstead of the entropy.Optimal Transport · 26-27.11.2019 · Page 44 (103)

Regularized optimal transport as a particular casemin {f (x) : Ax c} ,x Q Ewhere E – finite-dimensional real vector space; Q – simple closed convex set; A : E H, b H; f (x) is γ -strongly convex on Q w.r.t k · kE . i.e. for all x, y Q,f (y) f (x) h f (x), y xi γkx yk2E .2To obtain entropy-regularized optimal transport problem, set2 E Rn , H R2n , k · kE k · k1 , Q Sn2 (1); f (x) hC, Xi γhX, ln Xi; {x : Ax c} {X : X1 a, X T 1 b}.Optimal Transport · 26-27.11.2019 · Page 45 (103)

Flexibilitymin hC, Xi γR(X),X U (a,b)U(a, b) : {X Rn n: X1 a, X T 1 b}. Entropy regularization: f (X) hC, Xi γhX, ln Xi is strongly convex w.r.t.k · k1 on Sn2 (1). Squared Euclidean norm: f (X)the squared Euclidean norm.Optimal Transport · 26-27.11.2019 · Page 46 (103) hC, Xi γkXk22 is strongly convex w.r.t.

Dual problem min {f (x) : Ax c} min f (x) max hλ, Ax cix Qx Qλ H max hλ, ci min {f (x) hλ, Axi}x Qλ HDual problem min ϕ(λ) : hλ, ci max { f (x) hλ, Axi} .x Qλ H ϕ(λ) c Ax(λ),x(λ) : arg max { f (x) hλ, Axi} .x QNB: ϕ(λ) is Lipschitz-continuousϕ(λ) ϕ(ζ) h ϕ(ζ), λ ζi Optimal Transport · 26-27.11.2019 · Page 47 (103)kAk2E Hkλ ζk2H, .2γ

Particular examplemin hC, Xi γhX, ln XiX U (a,b)U(a, b) {X Rn n: X1 a, X T 1 b} {X Sn2 (1) : X1 a, X T 1 b}Dual problemmax γlnξ,ηnX 1exp (Cij ξi ηj ) hξ, ai hη, biγi,j 1Cf. with dual non-regularized problemmax{hξ, ai hη, bi Cij ξi ηj 0}ξ,ηmax{hξ, ai hη, bi min{Cij ξi ηj }}ξ,ηOptimal Transport · 26-27.11.2019 · Page 48 (103)i,j

Main assumptionsmin {f (x) : Ax c} ,x Q E Function f is γ -strongly convex. The problem max f (x) hAT λ, xix Qis simple: has a closed form solution or can be solved very fast up to themachine precision. The dual problem has a solution λ and there exist some R 0 such thatkλ k2 R .NB: R is used only in the convergence analysis, but not in the algorithm itself.Optimal Transport · 26-27.11.2019 · Page 49 (103)

Idea of adaptivity to Lipschitz constantϕ(λ) ϕ(ζ) h ϕ(ζ), λ ζi Optimal Transport · 26-27.11.2019 · Page 50 (103)Lkλ ζk2H, .2

AlgorithmRequire: Accuracy εf , εeq 0, initial estimate L0 s.t. 0 L0 2L.1:Set i0 k 0, M 1 L0 , β0 α0 0, η0 ζ0 λ0 0.2:repeat {Main iterate}3:4:5:6:7:8:repeat {Line search}2Set Mk 2ik 1 Mk , find αk 1 s.t. βk 1 : βk αk 1 Mk αk 1. Setτk αk 1 /βk 1 .λk 1 τk ζk (1 τk )ηk .[Update momentum] ζk 1 ζk αk 1 ϕ(λk 1 ).[Gradient step] ηk 1 τk ζk 1 (1 τk )ηk ηk 1 λk 1 M1k ϕ(λk 1 ).untilϕ(ηk 1 ) ϕ(λk 1 ) h ϕ(λk 1 ), ηk 1 λk 1 i 9:10:11:[Primal update] x̂k 1 τk x(λk 1 ) (1 τk )x̂k .Set ik 1 0, k k 1.until f (x̂k 1 ) ϕ(ηk 1 ) εf , kAx̂k 1 bk2 εeq .Ensure: x̂k 1 , ηk 1 .Optimal Transport · 26-27.11.2019 · Page 51 (103)Mkkηk 1 λk 1 k22 .2

Convergence theorem [D., Gasnikov, Kroshnin, 2018]Assume that the objective in the primal problem is γ -strongly convex and that the dualsolution λ satisfies kλ k2 R. Then, for k 1, the points x̂k , ηk in our Algorithmsatisfy16kAk2E H R2 Of (x̂k ) f f (x̂k ) ϕ(ηk ) γk 2 116kAk2E H R O,kAx̂k bk2 2γkk2 8 kAkE H R1kx̂k x kE O,kγk 1,k2where x and f are respectively an optimal solution and the optimal value in theprimal problem.Optimal Transport · 26-27.11.2019 · Page 52 (103)

Complexity for optimal transport problemε1. Given a target accuracy ε 0, choose γ 5 lnn and apply our Algorithm toentropy-regularized OT problemmin hC, Xi γhX, ln XiX U (r,c)with stopping criterionbk ) ϕ(ηk ) ε ,f (X10bk 1 rk1 kXb T 1 ck1 kXkε.10kCk bk . NB: Can happen that Xbk Output: X/ U(r, c).bk to Xb U(r, c) by Algorithm 2 in [Altschuler et.al.,2017].2. Round XComplexity theorem [D., Gasnikov, Kroshnin, 2018]b s.t. hC, Xib minX U (r,c) hC, Xi ε isTotal number of a.o. to obtain X(O minn9/4pkCk R ln n n2 ln nkCk R,εε2Optimal Transport · 26-27.11.2019 · Page 53 (103))!.

Comparison with existing results Sinkhorn’s algorithm, [Altschuler, Weed, Rigollet, 2017] On2 kCk3 ln nε3 . Sinkhorn’s algorithm, [D., Gasnikov, Kroshnin, 2018] On2 kCk2 ln nε2 It can be shown [Guminov, 2019] that R . kCk n. Then for AcceleratedGradient Descent, [D., Gasnikov, Kroshnin, 2018]OOptimal Transport · 26-27.11.2019 · Page 54 (103)! n2.5 kCk ln n.ε

Adaptive vs non-adaptive algorithmImages: S. OmelchenkoOptimal Transport · 26-27.11.2019 · Page 55 (103)

Comparison with Sinkhorn’s algorithmMNIST dataset, average in 10 randomly chosen images.Optimal Transport · 26-27.11.2019 · Page 56 (103)

Comparison with Sinkhorn’s algorithm 2MNIST dataset, average in 5 randomly chosen and scaled images,n [282 784, 2242 50176], ε 0.1.Optimal Transport · 26-27.11.2019 · Page 57 (103)

Content1Introduction2Application examples3Numerical methods for OT distanceSinkhorn’s algorithmAccelerated gradient methodHomework4OT barycentersIterative Bregman ProjectionsAccelerated gradient methodStochastic accelerated gradient methodOptimal Transport · 26-27.11.2019 · Page 58 (103)

DataHistograms Discrete-discrete case. Generate two random vectors a, b Quasi Semi-discrete case. Generate a Sn (1). Sn (1) and an empirical counterpart of1D Gaussian distribution. Quasi Continuous case. Generate two empirical counterpart of 1D Gaussiandistribution.Case 1. Equal variances, different expectation. Case 2. Different variances,different expectation. MNIST dataset.Case 1. Image of 1 and 3. Case 2. Image of 1 and 7.Cost Cij kxi xj k2 (Wasserstein - 1) Cij kxi xj k22 (Wasserstein - 2)Optimal Transport · 26-27.11.2019 · Page 59 (103)

Solvers Standard LP solver https://www.cvxpy.org/. Interior point methodsO(n3 ) complexity. Allows also to construct dual variables. Apply directly to the linear program. Apply to the entropy-regularized problem. Apply to the squared-Euclidean norm regularized problem. Sinkhorn’s method Kũ(k 1) exp( C/γ). a K ṽ (k) ṽ (k 1) b K T ũ(k 1) .ũ(k 1) exp( C/γ)ṽ (k 1) – approximation for the transport planExercise: How to find the approximation for the distance?Optimal Transport · 26-27.11.2019 · Page 60 (103)

Output Transport plan. Dual variables. Distance.Optimal Transport · 26-27.11.2019 · Page 61 (103)

1Introduction2Application examples3Numerical methods for OT distance4OT barycentersIterative Bregman ProjectionsAccelerated gradient methodStochastic accelerated gradient methodOptimal Transport · 26-27.11.2019 · Page 62 (103)

BarycenterOptimal Transport · 26-27.11.2019 · Page 63 (103)

One extra levelP(P(E)) – space of measures on P(E)P(E) – space of measures on EC(x, y) – transportation costyxE – basis spaceOptimal Transport · 26-27.11.2019 · Page 64 (103)

Wasserstein barycenterm1 X pWp (µi , ν),ν P2 (Ω) mi 1ν̂ arg minwhere Wp (µ, ν) is the Wasserstein distance between measures µ and ν on Ω.WB is efficient in machine learning problems with geometric data, e.g. templateimage reconstruction from random sample:Images from [Cuturi & Doucet, 2014]Optimal Transport · 26-27.11.2019 · Page 65 (103)

Template estimation from admissible deformationsAssume T – bounded random admissible transformation with finite moment ET Ti , i 1, ., m – random sample of realizations of Tµi (Ti )# µ – random sample of measures, where µ is compactly supported.mP1W22 (µi , ν) ν̂ arg minν P2 (Ω) m i 1Then If ET Id, then W22 (ν̂, µ) 0 a.s. as m . If kT IdkL2 M a.s., then P(W2 (ν̂, µ) ε) 2 exp mε2M 2 (1 cε/M )Boissard, Le Gouic, Loubes. Distribution’s template estimate with Wasserstein metrics. 2015Optimal Transport · 26-27.11.2019 · Page 66 (103) .

Wasserstein barycentersp-Kantorovich-Wasserstein distance Wp (a, b)(Wp (a, b))1/p min hC, Xi,X U (a,b)U(a, b) {X Rn n: X1 a, X T 1 b}. Given a set of m measures bi Sn (1), i 1, ., m, their Wasserstein barycenter ismm1 X1 X(Wp (a, bi ))p arg minmin hC, Xi ia Sn (1) ma Sn (1) mXi U (a,bi )i 1i 1â arg minm argmina Sn (1)Xi Rn n 1 XhC, Xi i.m i 1Xi 1 a, XiT 1 biLarge-scale linear program of dimension mn2 n.For simplicity, we omit p below.Optimal Transport · 26-27.11.2019 · Page 67 (103)

MotivationMain question: How much work is it needed to find their barycenter â Sn (1) withaccuracy ε?mml 1l 11 X1 XW(â, bl ) minW(a, bl ) εma Sn (1) mChallenges: Fine discrete approximation for ν and µ large n, Large amount of data large m, Data produced and stored distributedly (e.g. produced by a network of sensors), Possibly continuous measures µi .Optimal Transport · 26-27.11.2019 · Page 68 (103)

Entropic regularization [Cuturi & Doucet, 2014]γ -regularized Monge-Kantorovich (Wasserstein) distance (Sinkhorn’s distance)Wγ (a, b)Wγ (a, b) min hC, Xi γhX, ln Xi,X U (a,b)U(a, b) {X Rn n: X1 a, X T 1 b}. Given a set of m measures bi Sn (1), i 1, ., m, their regularized Wassersteinbarycenter ism1 XWγ (a, bi )a Sn (1) mi 1âγ arg minCuturi, Doucet. Fast Computation of Wasserstein Barycenters. ICML 2014.Optimal Transport · 26-27.11.2019 · Page 69 (103)

Smoothing by regularizationCuturi, Peyré. A smoothed dual approach for variational Wasserstein problems, 2015Optimal Transport · 26-27.11.2019 · Page 70 (103)

Backgroundm1 XWγ (a, bi ),a Sn (1) mi 1âγ arg minγ 0.Algorithms for barycenter Sinkhorn Gradient Descent [Cuturi, Doucet, ICML’14] Iterative Bregman Projections [Benamou et al., SIAM J Sci Comp’15] (Accelerated) Gradient Descent [Cuturi, Peyre, SIAM J Im Sci’16; Dvurechenskyet al, NeurIPS’18; Uribe et al., CDC’18]. Stochastic Gradient Descent [Staib et al., NeurIPS’17; Claici, Chen, Solomon,ICML’18]Question of complexity was open.How to choose γ ?Optimal Transport · 26-27.11.2019 · Page 71 (103)

Sinkhorn Gradient Descent [Cuturi, Doucet, ICML’14]By the dualitynX 1Wγ (a, b) max γexp (Cij ξi ηj ) hξ, ai hη, bi.ξ,ηγi,j 1By the Demyanov-Danskin theorem a Wγ (a, b) ξ . amXWγ (a, bi ) i 1mXξ ii 1Idea(k)1. Given current approximation aγ , find (ξ i )(k).2. Make a gradient descent stepa(k 1) a(k)γγ mmXXαkαk(k)(k) aWγ (a(k),b) a (ξ i )iaγγmmi 1i 1Cuturi, Doucet. Fast Computation of Wasserstein Barycenters. ICML 2014.Optimal Transport · 26-27.11.2019 · Page 72 (103)

Content1Introduction2Application examples3Numerical methods for OT distanceSinkhorn’s algorithmAccelerated gradient methodHomework4OT barycentersIterative Bregman ProjectionsAccelerated gradient methodStochastic accelerated gradient methodOptimal Transport · 26-27.11.2019 · Page 73 (103)

Iterative Bregman Projections [Benamou et al., 2015]mm1 X1 XWγ (a, bi ) minmin {hC, Xi i γhXi , ln Xi i} .a Sn (1) mXi U (a,bi )a Sn (1) mi 1i 1minTXi U(a, bi ), i 1, . . . , m Xi Rn n ; Xi 1 bi , Xi 1 Xi 1 1, i 1, . . .mminn nXi R m C1 X1 X{hXi , Ci γhXi , ln Xi i} minKL Xi , e γ ,X C1 C2 mm i 1i 1XiT 1 bi , Xi 1 Xi 1 1,i 1,.,mwhere C1 X [X1 , . . . , Xm ] : i XiT 1 bi ,C2 {X [X1 , . . . , Xm ] : a Sn (1) iXi 1 a} .Benamou, Carlier, Cuturi, Peyré. Iterative Bregman Projections for Regularized Transportation Problems, 2015Optimal Transport · 26-27.11.2019 · Page 74 (103)

Iterative Bregman Projections [Benamou et al., 2015]m C1 XKL Xi , e γ ,m i 1minX C1 C2where C1 X [X1 , . . . , Xm ] : i XiT 1 bi ,C2 {X [X1 , . . . , Xm ] : a Sn (1) i CγAlternating minimization X 0 [eX (k 1) arg minX C1mXXi 1 a} .C, ., e γ ]m X(k)(k 1)KL Xi , Xi, X (k 2) arg minKL Xi , Xi.i 1X C2i 1Benamou, Carlier, Cuturi, Peyré. Iterative Bregman Projections for Regularized Transportation Problems, 2015Optimal Transport · 26-27.11.2019 · Page 75 (103)

Iterative Bregman Projections: Dual viewm1 X{hXi , Ci γhXi , ln Xi i}m i 1minn nXi R XiT 1 bi , Xi 1 Xi 1 1,i 1,.,mDual problem:mminu,v1mPml 1f (u, v) : 1 X h1, Bl (ul , vl )1i hul , bl i ,ml 1vl 0u [u1 , . . . , um ], v [v1 , . . . , vm ], ul , vl Rn ,Bl (ul , vl ) : diag (eul ) exp ( C/γ) diag (evl ), K exp ( C/γ).IBP is equivalent to alternating minimization for the dual problem.tut 1: ln bl ln Kevl , vt 1 : vtlPmtt 11T utk vl: m ln K T eul , ut 1 : utk 1 ln K ePm â : Pm h1,B1 (u ,v )1il 1 Bl (ul , vl )1ll l l 1Kroshnin, Tupitsa, Dvinskikh, D., Gasnikov, Uribe. On the complexity of approximating Wasserstein barycenters, ICML2019.Optimal Transport · 26-27.11.2019 · Page 76 (103)

IBP complexity To find an ε approximation of the γ -regularized WB, Iterative Bregman11iterations. (cf. γεfor the Sinkhorn’s algorithm)Projections (IBP) needs γε Setting γ Θ (ε/ln n) allows to find an ε-approximation for the2n2non-regularized WB with arithmetic operations complexity mnε2 . (cf. ε2 for theSinkhorn’s algorithm). IBP can be implemented distributedly in a centralized architecture(master/slaves).Kroshnin, Tupitsa, Dvinskikh, D., Gasnikov, Uribe. On the complexity of approximating Wasserstein barycenters, ICML2019.Optimal Transport · 26-27.11.2019 · Page 77 (103)

Content1Introduction2Application examples3Numerical methods for OT distanceSinkhorn’s algorithmAccelerated gradient methodHomework4OT barycentersIterative Bregman ProjectionsAccelerated gradient methodStochastic accelerated gradient methodOptimal Transport · 26-27.11.2019 · Page 78 (103)

Dual reformulationm1 XWγ (a, bi ) a Sn (1) mi 1minmmin . aa1ma1 ,.,am Sn (1)1 XWγ (ai , bi )m

Image: Gabriel Peyré and Marco Cuturi. Computational Optimal Transport. Solves mass splitting problem in Monge formulation Monge transport maps are covered with transport plans of the form ˇ T (Id;T) # , i.e. ˇ T(A B) (fx2A: T(x) 2Bg). Optimal Transport 26-27.11.2019 Page 9 (103)