Achieving High Utilization With Software-Driven WAN - SIGCOMM

Transcription

Achieving High Utilizationwith Software-Driven WANChi-Yao Hong (UIUC)Ming ZhangSrikanth KandulaVijay GillMohan Nanduri Roger WattenhoferMicrosoftTuesday, August 13, 13Ratul Mahajan

Background: Inter-DC ng%Tuesday, August 13, 13Los%Angeles%Miami%

Background: Inter-DC ng%Los%Angeles%Inter-DC WANsare criticalTuesday, August 13, 13Miami%

Background: Inter-DC ng%Los%Angeles%Inter-DC WANsare criticalTuesday, August 13, 13Miami%Inter-DC WANsare highly expensive

Two key problemsPoor efficiencyPoor sharingaverage utilization over timeof busy links is only 30-50%little support forflexible resource sharingTuesday, August 13, 13

Two key problemsPoor efficiencyPoor sharingaverage utilization over timeof busy links is only 30-50%little support forflexible resource sharingWhy?Tuesday, August 13, 13

One cause of inefficiency:lack of coordinationNorm.trafficratemeanTime ( one day)Tuesday, August 13, 13

One cause of inefficiency:lack of coordinationBackground trafficNorm.trafficrateNon-background trafficTime ( one day)Tuesday, August 13, 13

One cause of inefficiency:lack of coordinationpeak before rate adaptationNorm.trafficrateBackground traffic 50%peak reductionpeak after rate adaptationNon-background trafficTime ( one day)Tuesday, August 13, 13

Another cause of inefficiency:local, greedy resource allocationMPLS TE (Multiprotocol Label Switching Traffic Engineering)greedily selects shortest path fulfilling capacity constraintTuesday, August 13, 13

Local, greedy resourceallocation hurts efficiencyflow arrival order: A, B, Ceach link can carry at most one flow123476MPLS-TETuesday, August 13, 135FlowABCSrc Dst1 63 64 6

Local, greedy resourceallocation hurts efficiencyflow arrival order: A, B, Ceach link can carry at most one flow123476MPLS-TETuesday, August 13, 135FlowABCSrc Dst1 63 64 6

Local, greedy resourceallocation hurts efficiencyflow arrival order: A, B, Ceach link can carry at most one flow123476MPLS-TETuesday, August 13, 135FlowABCSrc Dst1 63 64 6

Local, greedy resourceallocation hurts efficiencyflow arrival order: A, B, Ceach link can carry at most one flow123476MPLS-TETuesday, August 13, 135FlowABCSrc Dst1 63 64 6

Local, greedy resourceallocation hurts efficiencyflow arrival order: A, B, Ceach link can carry at most one flow123123765765MPLS-TETuesday, August 13, 13Optimal

Poor sharingTuesday, August 13, 13

Poor sharing When services compete today, they can gethigher throughput by sending fasterTuesday, August 13, 13

Poor sharing When services compete today, they can gethigher throughput by sending faster Mapping services onto different queues atswitches helps, but # services # queues(hundreds)Tuesday, August 13, 13(4 - 8 typically)

Poor sharing When services compete today, they can gethigher throughput by sending faster Mapping services onto different queues atswitches helps, but # services # queues(hundreds)(4 - 8 typically)Borrowing the idea of edge rate limiting, we canhave better sharing without many queuesTuesday, August 13, 13

Our solutionTuesday, August 13, 13

Our solutionSWAN : Software-driven WANTuesday, August 13, 13

Our solutionSWAN : Software-driven WAN high utilization flexible sharingTuesday, August 13, 13

System flowHostsTuesday, August 13, 13WANswitches

System flowtrafficdemandHostsTuesday, August 13, 13SWANcontrollertopology,trafficWANswitches

System flow[global optimization for high utilization]trafficdemandHostsTuesday, August 13, 13SWANcontrollertopology,trafficWANswitches

System flow[global optimization for high utilization]trafficdemandHostsTuesday, August 13, topology,trafficWANswitches

System flow[global optimization for high ionHosts[rate limiting]Tuesday, August 13, forwarding plane update]

Challenges scalable allocation computation congestion-free data plane update working with limited switch memoryTuesday, August 13, 13

Challenge #1: How to computeallocation in a time-efficient manner?Tuesday, August 13, 13

Computing resourceallocationPath-constrained, multi-commodity flow problem allocate higher-priority traffic first ensure weighted max-min fairness within a classSolving at the granularity of {DC pairs, priorityclass}-tuplesplit the allocation fairly among service flows Tuesday, August 13, 13

But computing max-minfairness is hardState-of-the-art takes minutes at our target scaleAs it needs to solve a long sequence of LPs:# LPs O(# saturated edges)[Danna, Mandal, Singh; INFOCOM’12]Tuesday, August 13, 13

Approximated max-minfairnessMax demandMCF solverMaximize throughputPrefer shorter pathsα0Tuesday, August 13, 13upper & lowerbounds

Approximated max-minfairnessMax demandMCF solverMaximize throughputPrefer shorter pathsfreeze saturatedflow ratesα0Tuesday, August 13, 13upper & lowerbounds

Approximated max-minfairness.Max demand2αα0Tuesday, August 13, 13MCF solverMaximize throughputPrefer shorter pathsupper & lowerboundsfreeze saturatedflow rates

PerformanceTheoretical bound:Empirical efficiency (with α 2): only 4% of flows deviate over 5% fromtheir fair share rate sub-second computational timeTuesday, August 13, 13

Fairness:SWAN vs. MPLS TEFlowgoodput[versusmax-minfair rate]Tuesday, August 13, 13432104 032100SWAN; α 2100200300400500600200300400500600MPLS TE100Flow index[increasing order of demand]

Challenge #2:Congestion-free updateHow to update forwarding plane withoutcausing transient congestion?Tuesday, August 13, 13

Congestion-free updateis hardBABAinitial stateTuesday, August 13, 13target state

Congestion-free updateis hardBABAinitial stateBtarget state ATuesday, August 13, 13

Congestion-free updateis hardBABAinitial stateBtarget state ATuesday, August 13, 13BA

In fact, congestion-free updatesequence might not exist!Tuesday, August 13, 13

IdeaLeave a small amount ofscratch capacity on each linkTuesday, August 13, 13

Slack 1/3 of link capacity .Init. stateA 2/3target stateB 2/3B 2/3A 2/3Tuesday, August 13, 13

Slack 1/3 of link capacity .Init. stateA 2/3target stateB 2/3B 2/3A 2/3A 2/3B 1/3B 1/3Tuesday, August 13, 13

Slack 1/3 of link capacity .Init. stateA 2/3target stateB 2/3B 2/3A 2/3A 2/3B 1/3B 1/3B 1/3B 1/3A 2/3Tuesday, August 13, 13

Slack 1/3 of link capacity .Init. stateA 2/3target stateB 2/3B 2/3A 2/3Does slack guarantee that A 2/3congestion-free update always exists?B 1/3B 1/3B 1/3B 1/3A 2/3Tuesday, August 13, 13

Yes!With slack: we prove there exists a congestion-freeupdate instepsone step multiple updateswhose order can be arbitraryTuesday, August 13, 13

Yes!With slack: we prove there exists a congestion-freeupdate instepsone step multiple updateswhose order can be arbitraryIt exsits, but how to find it?Tuesday, August 13, 13

Congestion-free update:LP-based solutionstep rate variable:flowTuesday, August 13, 13path

Congestion-free update:LP-based solutionstep rate variable:flow input:Tuesday, August 13, 13andpath output:.

Congestion-free update:LP-based solutionstep rate variable:pathflow input:and output:. congestion-free constraint:link capacity i,j on a linkTuesday, August 13, 13

Utilizing all the capacitynon-backgroundis congestion-freebackground hasbounded congestionusing 90% capacity(s    10%)using 100% capacity(s    0%)Tuesday, August 13, 13

SWAN versus one-shotupdate10%CCDF overlinks &updates1%0.1%0.01%0100200300Overloaded traffic [MB][data-driven evaluation; s 10% for non-background]Tuesday, August 13, 13

SWAN versus one-shotupdateOne-shot update bringsheavy packet drops10%CCDF overlinks 0.1%0.01%0100200300Overloaded traffic [MB][data-driven evaluation; s 10% for non-background]Tuesday, August 13, 13

SWAN versus one-shotupdateSWAN10% non-background: congestion-freebackground: much better than one-shotCCDF over1%links &SWAN backgroundupdatesone-shot0.1%background0.01%0100 200 300Overloaded traffic [MB][data-driven evaluation; s 10% for non-background]Tuesday, August 13, 13

Challenge #3Working with limited switch memoryTuesday, August 13, 13

Why does switchmemory matter?Tuesday, August 13, 13

Why does switchmemory matter?How manywe need? 50 sites 2,500 pairs 3 priority classes static k-shortest path routing[by data-driven analysis]Tuesday, August 13, 13

Why does switchmemory matter?How manywe need? 50 sites 2,500 pairs 3 priority classes static k-shortest path routing[by data-driven analysis]it requires 20K rules tofully use network capacityTuesday, August 13, 13

Why does switchmemory matter?How manywe need? 50 sites 2,500 pairs 3 priority classes static k-shortest path routing[by data-driven analysis]it requires 20K rules tofully use network capacityCommodity switches has limited memory: today’s OpenFlow switch: 1-4K rules next generation: 16K rules [Broadcom Trident II]Tuesday, August 13, 13

HardnessFinding the set of paths with a given sizethat carries the most traffic is NP-complete[Hartman et al., INFOCOM’12]Tuesday, August 13, 13

Heuristic: Dynamic pathset adaptationTuesday, August 13, 13

Heuristic: Dynamic pathset adaptationObservation: working path set total needed pathsTuesday, August 13, 13

Heuristic: Dynamic pathset adaptationObservation: working path set total needed pathsPath selection: important ones that carry more traffic andprovide basic connectivity 10x fewer rules than static k-shortest path routingTuesday, August 13, 13

Heuristic: Dynamic pathset adaptationObservation: working path set total needed pathsPath selection: important ones that carry more traffic andprovide basic connectivity 10x fewer rules than static k-shortest path routingRule update: multi-stage rule update with 10% memory slack, typically 2 stages neededTuesday, August 13, 13

Overall workflowCompute resourceallocationTuesday, August 13, 13

if notOverall workflowCompute resourceallocationif enough gainCompute ruleupdate planTuesday, August 13, 13

if notOverall workflowCompute resourceallocationif enough gainCompute ruleupdate planCompute congestionfree update planTuesday, August 13, 13

if notOverall workflowCompute resourceallocationif enough gainCompute ruleupdate planCompute congestionfree update planTuesday, August 13, 13Notify services withdecrease allocation

if notOverall workflowCompute resourceallocationif enough gainCompute ruleupdate planUpdate networkCompute congestionfree update planNotify services withdecrease allocationTuesday, August 13, 13

if notOverall workflowCompute resourceallocationif enough gainNotify services withincrease allocationCompute ruleupdate planUpdate networkCompute congestionfree update planNotify services withdecrease allocationTuesday, August 13, 13

Evaluation platformsPrototype 5 DCs across 3 continents;10 switchesCisco N3KAristaBladeServerTuesday, August 13, 13

Evaluation platformsPrototype 5 DCs across 3 continents;10 switchesData-driven evaluation 40 DCs across 3continents, 80 switches G-scale: 12 DCs, 24 switchesTuesday, August 13, 13Cisco N3KAristaBladeServer

Prototype evaluationGoodput(normalized& Interactive45Time [minute]Traffic: ( DC-pair) 125 TCP flows per classTuesday, August 13, 136

Prototype evaluationdips due to rate adaptationGoodput(normalized& stacked)10.80.60.40.20(impractical)optimal lineBackgroundElastic0123ElasticInteractive45Time [minute]Traffic: ( DC-pair) 125 TCP flows per classHigh utilizationSWAN’s goodput:98% of an optimal methodTuesday, August 13, 136

Prototype evaluationdips due to rate adaptationGoodput(normalized& stacked)10.80.60.40.20(impractical)optimal lineBackgroundElastic0123ElasticInteractive45Time [minute]Traffic: ( DC-pair) 125 TCP flows per classHigh utilizationSWAN’s goodput:98% of an optimal methodTuesday, August 13, 13Flexible sharingInteractive protected;background rate-adapted6

Data-driven evaluationof 40 DCs10.80.6Goodput[versus optimal] 0.40.20Tuesday, August 13, 13SWANMPLSSWAN w/o RateTEControl

Data-driven evaluationof 40 DCs1Near-optimal total goodput99.0%under a practical setting0.80.6Goodput[versus optimal] 0.40.20Tuesday, August 13, 13SWANMPLSSWAN w/o RateTEControl

Data-driven evaluationof 40 DCs10.80.6Goodput[versus optimal] 0.499.0%SWAN carries 60% moretraffic than MPLS-TE58.3%0.20Tuesday, August 13, 13SWANMPLSSWAN w/o RateTEControl

Data-driven evaluationof 40 DCs10.80.6Goodput[versus optimal] 0.499.0%SWAN w/o rate controlstill carries 20% moretraffic than MPLS TE70.3%58.3%0.20Tuesday, August 13, 13SWANMPLSSWAN w/o RateTEControl

SWAN : Software-driven WANTuesday, August 13, 13

SWAN : Software-driven WAN High utilization and flexible sharing via globalrate and route coordinationTuesday, August 13, 13

SWAN : Software-driven WAN High utilization and flexible sharing via globalrate and route coordination Scalable allocation via approximated max-minfairnessTuesday, August 13, 13

SWAN : Software-driven WAN High utilization and flexible sharing via globalrate and route coordination Scalable allocation via approximated max-minfairness Congestion-free update in bounded stagesTuesday, August 13, 13

SWAN : Software-driven WAN High utilization and flexible sharing via globalrate and route coordination Scalable allocation via approximated max-minfairness Congestion-free update in bounded stages Using commodity switches with limitedmemoryTuesday, August 13, 13

ConclusionAchieving high utilization itself is easy, but coupling itwith flexible sharing and change management is hardTuesday, August 13, 13

ConclusionAchieving high utilization itself is easy, but coupling itwith flexible sharing and change management is hardApproximating max-minfairness with lowcomputational timeTuesday, August 13, 13

ConclusionAchieving high utilization itself is easy, but coupling itwith flexible sharing and change management is hardApproximating max-min Keeping scratch capacity offairness with lowlinks and switch memory tocomputational timeenable quick transitionsTuesday, August 13, 13

Thanks!jehon t!tekramobChi-Yao Hong (UIUC)Ming ZhangSrikanth KandulaVijay GillMohan Nanduri Roger WattenhoferMicrosoftTuesday, August 13, 13Ratul Mahajan

SWAN versus B4SWANB4high utilizationyesyesscalable rate and ee updatein boundedstepsnousing commodity switcheswith limited # forwardingrulesyesnoTuesday, August 13, 13

Demo Video: SWANachieves high utilizationTuesday, August 13, 13

Demo Video: SWANprovides flexible sharingTuesday, August 13, 13

Failure handlingLink and switch failures Network agent notifies SWAN controllerSWAN controller performs one-time globalrecomputationController crashes Run stateless backup instancesDuring the recovery, data plane will still operateController bugs Tuesday, August 13, 13Work in progress

Demo Video: SWAN handleslink failures gracefullyTuesday, August 13, 13

SWAN controllerGlobal allocation at {SrcDC, DstDC, ServicePriority}-level support a few priorities (e.g.,background elastic interactive) map flows to priority queues at switches (via DSCP bits)30%Label-based forwarding (“tunnels”) by tagging VLAN IDs SWAN controllerglobally computeshow to split traffic atingress switchesTuesday, August 13, 1350%Src20%Dst

Link-level fairness ! network-wide 2Tuesday, August 13, 13D1S1D2S2D3S31/31/3D1D2D32/3

Time for network updateTuesday, August 13, 13

How much stretchcapacity is needed?s 50%30%10%0%max steps 99th pctl. goodput179%1 391%29100% 3 6100%[data-driven evaluation]Tuesday, August 13, 13

Our heuristic:dynamic path selectionSolve rateallocationif can fit in memoryfin.if notGreedily selectimportant pathsSolve again withselected pathsUsing 10x fewer paths thanstatic k-shortest path routingTuesday, August 13, 13

Rule update withmemory constraintsOption #1:Fully utilize all the switch memory?oldrulesswitchmemoryTuesday, August 13, 13newrulesRule update maydisrupt traffic

Rule update withmemory constraintsOption #2:Leave 50% slack [Reitblatt et al.; SIGCOMM’12]oldrulesswitchmemoryTuesday, August 13, 13newrulesWaste a halfswitch memory

Multi-stage rule updateactiverulesswitchmemoryTuesday, August 13, 13

Multi-stage rule updateactiverulesswitchmemoryTuesday, August 13, 13

Multi-stage rule updateactiverulesswitchmemoryTuesday, August 13, 13

Multi-stage rule updateactiverulesswitchmemoryTuesday, August 13, 13

Multi-stage rule updateactiverulesswitchmemoryTuesday, August 13, 13

Multi-stage rule updateactiverulesswitchmemoryTuesday, August 13, 13

Multi-stage rule updateactiverulesswitchmemoryTuesday, August 13, 13

Multi-stage rule update# stages bound:f(memory slack)activerulesswitchmemoryTuesday, August 13, 13

Multi-stage rule update# stages bound:f(memory slack)When slack 10%,2 stages for 95% of timeactiverulesswitchmemoryTuesday, August 13, 13

with Software-Driven WAN Chi-Yao Hong (UIUC) Srikanth Kandula Ratul Mahajan . WAN switches SWAN traffic controller demand topology, traffic Hosts Tuesday, August 13, 13. System flow WAN switches SWAN traffic controller demand topology, traffic [global optimization for high utilization] Hosts Tuesday, August 13, 13. System flow WAN .