Terra: Scalable Cross-Layer GDA Optimizations

Transcription

Terra: Scalable Cross-Layer GDA OptimizationsJie YouMosharaf ChowdhuryUniversity of MichiganAnn Arbor, Michiganjieyou@umich.eduUniversity of MichiganAnn Arbor, Michiganmosharaf@umich.eduarXiv:1904.08480v1 [cs.DC] 17 Apr 2019Abstractswitch [48, 50, 65, 66, 71, 72]. Although these simplifications – end-to-end tunnels and independent links, respectively– decouple GDA systems from WAN traffic engineering, theyintroduce a disconnect. Applications cannot optimize on actual points of contention which are hidden and constantlychanging in the WAN; at the same time, WAN traffic engineering cannot optimize application-level objectives.This mismatch between application- and WAN-level goalsprolongs the communication stages of GDA jobs and increases their job completion times (JCT) (§2). Existing solutions that attempt to align the two [61, 83] do not scale tolarge WAN topologies or complex jobs, and they themselvescan become the bottleneck (§3). The presence of WAN uncertainties such as large bandwidth fluctuates and link failuresadds to the challenge because GDA jobs cannot rapidly adaptto changing WAN conditions.Our goal in this paper is to speed up the communicationstages of GDA jobs and to make them more robust to WANuncertainties. To this end, we present Terra, a scalable framework that bridges the gap between GDA frameworks and theWAN by co-optimizing application-level transfer schedulingand WAN-level routing decisions. Terra’s design is guided bytwo high-level observations: Redundant paths in the WAN topology should be fullyutilized in order to minimize GDA transfer times; and SD-WAN rule updates are expensive and should beavoided whenever possible for fast decision enforcement.We propose a two-pronged approach that can scale to largeGDA jobs and WAN topologies while adhering to these observations. First, we propose a scalable algorithm to quickly compute multipath WAN routing and GDA scheduling decisions(§3). To this end, we generalize existing coflow-based solutions used inside single datacenters [30, 33, 83] to considerthe WAN topology. Then we make it scalable by treating allflows between the same datacenter pair from the same coflowtogether instead of treating each one independently, reducing the problem size by many orders-of-magnitude. Second,we propose a multipath overlay on top of single-path TCPconnections over the entire WAN and enforce our algorithmdetermined routes, schedules, and rates on this overlay, limiting the need for WAN rule updates only to (re)initialization(§4). Our algorithm-systems co-design can address WAN uncertainties by quickly recomputing GDA transfer schedulesand reconfiguring the WAN overlay.We have a full-stack implementation of the proposed solution (§5), integrated with the FloodLight [13] SDN controllerGeo-distributed analytics (GDA) frameworks transfer largedatasets over the wide-area network (WAN). Yet existingframeworks often ignore the WAN topology. This disconnectbetween WAN-bound applications and the WAN itself resultsin missed opportunities for cross-layer optimizations.In this paper, we present Terra to bridge this gap. Insteadof decoupled WAN routing and GDA transfer scheduling,Terra applies scalable cross-layer optimizations to minimizeWAN transfer times for GDA jobs. We present a two-prongedapproach: (i) a scalable algorithm for joint routing and scheduling to make fast decisions; and (ii) a scalable, overlay-basedenforcement mechanism that avoids expensive switch rule updates in the WAN. Together, they enable Terra to quickly reactto WAN uncertainties such as large bandwidth fluctuationsand failures in an application-aware manner as well.Integration with the FloodLight SDN controller andApache YARN, and evaluation on 4 workloads and 3 WANtopologies show that Terra improves the average completiontimes of GDA jobs by 1.55 –3.43 . GDA jobs running withTerra meets 2.82 –4.29 more deadlines and can quicklyreact to WAN-level events in an application-aware manner.1IntroductionTo cope with the increasing number of Internet users andedge devices [11], large organizations leverage tens to hundreds of datacenters and edge sites [1, 10, 12, 28] to gatherdata related to end-user sessions and their devices as well asmonitoring logs and performance counters. Analyzing andpersonalizing this data can provide tremendous value in improving user experience [8, 65, 72]. Consequently, a growingbody of recent work has focused on enabling geo-distributedanalytics (GDA), i.e., executing computation tasks on thedata stored in-place at different sites instead of copying toa single datacenter. Faster completions of these jobs can enable log processing [65, 66], SQL query processing [71, 72],and machine learning [48] over large geo-distributed datasets.Assuming static WAN bandwidth, existing solutions optimize query planning [49, 66, 71, 72], scheduling [50, 65],and algorithm design [48] to reduce inter-datacenter transfersover the WAN. This is because WAN bandwidth is expensive[54, 58] and often a major performance bottleneck for thesecommunication-intensive jobs [48, 71, 72] (e.g., due to largeintermediate data transfers [65]).Unfortunately, existing GDA frameworks ignore the WANtopology and treat the WAN as a full mesh or a non-blocking1

BCoflow -2: 20sCoflow -1: 8sR1 M2Coflow -1: 8.6s Coflow -2: 12.6sC BC BA BA B6 GbpsM1ACDatacenter8WAN LinkMapTasksReduceTasksWAN ShuffleLAN Shuffle5 GB5 GBABCoflow-110 GB16248(c) Flow-level fair sharing(a) MapReduce Job on WANf11time1624(d) MultipathCoflow -2: 20sCoflow -1: 4stimeCoflow -1: 2.9s Coflow-2: 11.4sC BC BA BA BACBCoflow-2(b) WAN Traffic of 2 jobs8time1624(e) Coflow scheduling8time1624(f) TerraFigure 1. Opportunities for scheduling-routing co-optimization of two jobs running across three datacenters. (a) MapReduce Job runningon WAN topology. (b) Coflows from Job-1 (dark/blue) has 1 flow and Job-2 (light/orange) has 2 flows (dark/red and light/orange). (c)–(f)Bandwidth allocations of the two bottleneck links (A B and C B) w.r.t. time for 4 different scheduling-routing policies. Averagecompletion times for (c) per-flow fair sharing is 14 seconds; (d) multipath is 10.6 seconds; (e) intra-datacenter coflow scheduling [33, 83]is 12 seconds; (f) Terra finds the optimal routing-scheduling joint solution: 7.15 seconds.2in the network-side and Apache YARN [3] in the applicationside. It provides a simple API to express WAN coflows. Userwritten jobs in a framework remain unmodified.We evaluated Terra using three WAN topologies (Microsoft’s SWAN [47], Google’s G-Scale [53], and AT&T’sMPLS topology [6]) and four different workloads (BigBench[7], TPC-DS [15], and TPC-H [16] with complex DAGs andFacebook data transfer matrices from simple MapReduce jobs[9, 14]) (§6). For the small-scale testbed experiment usingthe SWAN topology, Terra improved the average job completion time (JCT) by 1.55 –3.43 on average and 2.12 –8.49 at the 95th percentile, while improving WAN utilization by1.32 –1.76 . For large-scale simulations, Terra improvedthe average JCT by 1.04 –2.53 for the smallest topology(SWAN) and 1.52 –26.97 for the largest topology (AT&T)against baselines such as per-flow fairness, multipath routing,SWAN-MCF [47], Varys [33], and Rapier [83]. Terra cancomplete 2.82 –4.29 more coflows within their deadlines.We also show that it can react quickly to WAN events, itcan scale well, and its benefits hold across a wide range ofparameter and environment settings.In summary, our contributions in this paper are three-fold:1. Identifying scalability bottlenecks in the GDA-WAN cooptimization problem both from algorithm and systemdesign perspectives.2. A scalable algorithm that co-optimizes complex GDAjobs with large WAN topologies to minimize their datatransmission times.3. A scalable system design and implementation that integrates with GDA frameworks and SD-WANs to enforcethose decisions and provides large performance benefits.Background and MotivationThis section provides a quick overview of GDA systems(§2.1), common WAN models used by them (§2.2), and thecoflow abstraction (§2.3), followed by an illustration of theadvantages of application-WAN co-optimization (§2.4).2.1Geo-Distributed Analytics (GDA)GDA users submit jobs written in higher-level interfaces –e.g., in SparkSQL [26], Hive [4], or GraphX [41] – typicallyto one central job master that resides in one of the many distributed sites/datacenters [65, 72]. The master constructs anoptimized execution plan [71] for the job and represents it asa directed acyclic graph (DAG). Nodes of this DAG representcomputation stages with many parallel tasks and edges represent stage dependencies as well as WAN transfers betweentasks in different datacenters. A centralized scheduler thenplaces these tasks at machines across different datacentersbased on data locality, resource availability, and their dependencies [50, 71, 72]. The durations of these jobs typicallyrange from minutes to tens of minutes and communicationacross the WAN is often the bottleneck [65, 71].2.2Inter-Datacenter WAN ModelDatacenters used by GDA frameworks are connected by aWAN. While such WANs have traditionally been optimizedusing MPLS-based [75] traffic engineering, centralized SDWANs are becoming more popular [47, 53, 58]. We assumethe latter, which enables Terra to make and enforce topologyaware routing decisions. Existing GDA systems assume eithera full mesh with heterogeneous link capacities [48, 71] or anon-blocking switch with contentions only at the uplinks ordownlinks of each datacenter [65]. As such, they miss the2

Coflow -3: 16s Coflow -4: 20sCoflow -3: 8s Coflow -4: 8s10 GB4 GBf31CBCoflow-3Coflow -3: 8s Coflow -4: 20sC BC BC BB AB AB AB6 GBCACoflow-4(a) Two coflows8time816(b) Optimaltime16(c) Suboptimal upon failure8time16(d) Optimal after failureFigure 2. Need for application-aware WAN re-optimization. (a) On Figure 1a’s topology, Coflow-3 (dark/blue) has 1 flow and Coflow4 (light/orange) has 2 flows (dark/red and light/orange). Average completion times (b) for the optimal solution is 8 seconds; (c) afterrerouting f 42 due to the failure of the A–C link is 18 seconds; (d) for the new optimal solution after the failure is 14 seconds.opportunity to utilize redundant paths in WAN.Existing solutions assume that the WAN topology andavailable bandwidth remain fixed during a job’s execution.However, this may lead to performance issues when WANconfiguration is updated in the middle of a job’s execution.For example, SWAN [47] updates WAN configurations every5 minutes. Because Terra is integrated with the SD-WANcontroller, unlike existing solutions, it can monitor and reactto these changes. High-priority, user-facing, and deadlinesensitive traffic are prioritized by WAN managers [57, 58],including Terra. So we consider a link’s bandwidth to be theremaining capacity excluding those traffic.2.3completion time for both coflows.Existing Solutions First, let us consider the classic flowlevel fair sharing that equally divides the A B link betweenflows f 11 and f 21 (Figure 1c). Thus, both flows complete by8 seconds, whereas f 22 completes by 20 seconds facing nocontention. Consequently, Coflow-1 and Coflow-2 completein 8 and 20 seconds, respectively. The average completiontime is 14 seconds.A simple improvement would be using multiple paths (e.g.,MPTCP [36]) to increase network utilization (Figure 1d).In this case, all the flows are split across available paths.Assuming equal split and fair sharing in each link, the averagecompletion time is 10.6 seconds.Coflow-aware scheduling [31–33, 83] improves the average completion time by considering all the flows of the samecoflow together. In this case, f 11 will be scheduled before f 21on the A–B link (Figure 1e). Consequently, Coflow-1 finishesin 4 and Coflow-2 in 20 seconds. The average completiontime is 12 seconds. These coflow-based solutions still fallshort either by assuming a non-blocking topology [33] or dueto considering a single path [83].Co-Optimization Finds the Optimal Solution So far wehave shown 3 sub-optimal solutions, where they only optimize one side of the Application-WAN duo. However, if weconsider both simultaneously and combine coflow schedulingwith multipath routing together, we can achieve the optimalaverage completion time of only 7.15 seconds (Figure 1f).Note that we considered only two jobs and a minimal fullmesh topology in the offline scenario. Terra performs evenbetter with more jobs, on larger WAN topologies that are notfull mesh, and in online scenarios (§6).Re-Optimization is Necessary Under Uncertainties Consider the same topology as in Figure 1a but with two different coflows (Figure 2a). Existing WAN-agnostic solutions[65, 71] will schedule Coflow-3 and Coflow-4 together toachieve the optimal average completion time of 8 seconds.However, if the link between A and C fails (or experiences amassive increase in high-priority traffic) right after the scheduling decision has been made, the WAN will reroute f 42 andthe completion times of Coflow-3 and Coflow-4 would become 16 and 20 seconds, respectively. Hence, the averagecompletion time would be 18 seconds.The optimal solution is rescheduling Coflow-3 beforeThe Coflow AbstractionGDA jobs typically use the same programming models (e.g.,MapReduce) as traditional analytics jobs running within asingle datacenter [65, 71, 72]. These programming modelsoften have a communication stages between computationstages, where the computation stage cannot start until all flowsin the preceding communication stage have finished. Recently,Chowdhury and Stoica defined such a collection of flows witha shared fate as a coflow [30], and many have shown thatminimizing a coflow’s completion time (CCT) can decreasea job’s completion time [18, 33, 83]. The coflow abstractiongeneralizes intermediate data transfers (i.e., shuffles) for GDAjobs too [62].2.4Potential Benefit of Co-OptimizationSetup Without loss of generality, we consider a GDA jobwith one map stage and one reduce stage for ease of explanation. Now consider this job running on the WAN topology with 3 datacenters {A, B, C} as shown in Figure 1a. Suppose a GDA query planning and task placement algorithm[49, 66, 71, 72] has put some map tasks in A, others in B,and all reduce tasks in B. Therefore, part of its shuffle trafficwould be transfered over the WAN from A to B, while theother part would be inside datacenter B. We focus on WANtraffic here because the limited bandwidth of WAN becomesthe bottleneck for GDA jobs [65, 71, 72]. Assuming the totalvolume of intermediate data generated by M 1 at datacenter Ato be 5 GB, we can now form a coflow (Coflow-1 in Figure1b). Coflow-2 is similarly generated, but with a different communication pattern. Our goal now is to minimize the average3

Duration of A Schedule(seconds)WAN- Leverages AppReAware* Multipath Aware Optimizes†Per-Flow/TCP Multipath/MPTCP Datacenter Coflows SD-WANs GDA Systems Terra 0SWANG-ScaleATT B AC(a) Coflow with 16n flows.Summary B3nAC(b) FlowGroups of this Coflow.Figure 4. Scaling down the number of flows in a coflow.Table 1 compares the solutions discussed above across keycriteria. The key takeaways from this section are:1. The optimal average coflow completion time can only beachieved when jointly considering routing and scheduling;2. In the presence of WAN uncertainties (e.g., bandwidthfluctuations and failures), application-level schedulingmust react to WAN-level routing, and vice versa.consider the offline problem of scheduling-routing C GDAcoflows (C {C 1 , C 2 , . . . , C C }) that arrived at time T .3.1.1 The Minimum CCT of a Single CoflowWe start by focusing on a single coflow and decide how toroute its flows and at what rate to send the traffic.Scalability Limitation Calculating the rate and routing forevery single flow is impractical, because the number of flowsfor even one coflow can be very large. Per our measurement,the computation time of a state-of-the-art solution that considers coflow routing [83] is 1.952 seconds on average for theBigBench workload on the SWAN topology. The computationoverhead only increases for even larger topologies (Figure 3).Clearly, we cannot calculate rate allocation and routing foreach flow of each coflow.Per-Flow Rate Allocation is Unnecessary Existing solutions show minimal coflow completion time can be achievedby enforcing per-flow rate to ensure that all of its flows finish together [32, 33, 83]. However, we observe that we canstill achieve minimal coflow completion time, even whenindividual flows do not finish together.Consider a MapReduce job running on the same WANprovided in Figure 1a. Assume there are 5n map tasks placedin B, 3n map tasks placed in C and 2 reduce tasks placed inA (Figure 4a). So there are a total of 16n flows in this coflow.Suppose for each flow we need to send 1 GB data, enforcingall flows finish together gives all flows 1/n Gbps throughput.This allocation gives a minimal coflow completion time of 8nseconds, and both link B A and C A are fully utilizedall the time.Now, we take all the flows traversing through link B A,and change their rate allocation – we schedule them one-ata-time in the FIFO order, allocating the entire bandwidth ofB A (10 Gbps) to each of them. We can still achieve thesame CCT of 8n seconds. This gives us the following lemma.Terra: Algorithm DesignGiven the benefits of co-optimization, the need for fast reoptimizations, and the scale and heterogeneity of the WAN,we must design a cross-layer solution that can perform wellat WAN scale. This requires both designing a scalable algorithmic solution that can quickly make joint schedulingand routing decisions and a scalable system design that canquickly enforce those decisions throughout the WAN. In thissection, we focus on the former. Section 4 discusses the latter.3.1Didn't finishafter 10s5 5nCoflow-4 so that they complete in 8 and 20 seconds for thenew minimum average completion time of 14 seconds.3Avg.95%-tileFigure 3. Scheduling overhead of a state-of-the-art solution [83].Table 1. Terra vs existing solutions. *WAN-Aware: does not assume full-mesh topology, non-blocking core, or symmetric paths.†Re-Optimize: application-aware re-scheduling and rerouting ofWAN transfers.2.510Minimizing the Average Completion TimeTerra’s primary goal is faster completions of WAN transfersfrom geo-distributed jobs, i.e., minimizing the average CCT.Given a coflow, Terra must decide when to start individualflows, at what rate to send them, and which path(s) each flowshould take.This problem is computationally intractable even when allcoflows start together, their traffic matrices are known a priori,and the WAN is non-blocking. Because inter-coflow scheduling in a non-blocking datacenter is known to be NP-hardunder these assumptions [33], the counterpart on a generalWAN topology is NP-hard too. Given the intractability of theproblem, we first focus on designing an efficient offline heuristic (§3.1.1–§3.1.2), then extend to online scenarios (§3.1.3).Consider a WAN graph G (V , E), where V is the set ofnodes – sets of datacenters – and E is the set of WAN links. Werepresent multiple physical links between u and v (u, v V )with one logical link e (u, v) E with the cumulativebandwidth. At time T , e’s available bandwidth is cT (u, w). WeLemma 3.1. If multiple flows of the same coflow have4

Pseudocode 1 Offline Scheduling-Routingthe same src datacenter , dst datacenter pair, all workconserving rate allocation of them will achieve the samecompletion time.1: procedure ALLOC BANDWIDTH(Coflows C, WAN G)2: Scale down G by (1 α) Starvation freedom3: CFailed Coflows not scheduled in entirety4: for all Ci C do5:Γi , fik Solve Optimization (1) for Ci on G6:if Γi 1 thenÐ7:CFailed CFailed Ci8:continue9:if D i , 1 then Ci has a deadline10:Scale down fik by Γi /D iConsequently, we can group flows within a coflow by their src datacenter, dst datacenter tuple. The rates of individual flows within such a group do not directly affect thecoflow completion time, as long as the total rate of the groupremains the same. This grouping is similar to that of FlowGroup [58]; for simplicity, we refer to such groups of flowsas FlowGroups.The notion of FlowGroup brings performance improvements in both calculating and enforcing the rate allocation.Because we only need per-FlowGroup rate allocation, thescale of our problem formulation is reduced to orders-ofmagnitude smaller (O( FlowGroups / Flows )). For example,in Figure 4b, 16n flows become only 2 FlowGroups. Thissignificantly lowers our scheduling overhead (§6.6).Solution Approach We can now formulate an optimizationproblem to minimize the CCT for a single coflow on a general topology. Previous works [83] assumed that a flow canonly traverse through a single path, leading to an Integer Linear Programming (ILP) formulation, which is computationintensive. Because of Lemma 3.1, we can assume that a FlowGroup can be split across many paths, therefore eliminatingall integral constraints and leading to a LP formulation.We organize our solution in two steps:1. Scale down by coalescing flows into FlowGroups; and2. Obtain fractional routes for FlowGroups while minimizing the CCT.Step 1: Scaling Down Using FlowGroups In this step, wecollapse all flows from the same coflow with the same src datacenter, dst datacenter tuple to one FlowGroup.We can now represent a coflow Ci as a collection of FlowGroups Di [di (u, v)] D D , where di (u, v) represents thetotal amount WAN transfers between the machines of Ci indatacenters u and v. Di represents the set of FlowGroups withnon-zero volumes in Di .Step 2: Determining CCT Lower-Bound We now determine the paths and rates of individual FlowGroups to minimize the CCT. We denote the completion time of coflow Ciby Γi . Here Γi is defined as:Γi max T (dik ),k11:12:13:14:15:16:17: procedure MINIMIZE CCTO FFLINE(Coflows C)18: C ′ Sort C by increasing Γi19: allocBandwidth(C ′ , G)20: end procedureFlowGroup, we can then ensure that they make 1/Γi progressevery time unit by enforcing the following constraints:Õf k (src(dik ), w) dik /Γiw VÕf k (w, dst(dik )) dik /Γiw VThe former ensures that the outgoing rate of a FlowGroup isproportional to its volume, while the latter enforces the sameon the receiving end. Finally, we enforce usual capacity andflow conservation as follows.Õf k (u, v) f k (v, u) 0, u , src(dik ), dst(dik )v VÕf k (u, v) cT (u, v)f k (u, v) 0Note that enforcing 1/Γi rate to all FlowGroups leaves themaximum amount of bandwidth possible for other coflowsthat are scheduled after Ci without sacrificing Ci ’s CCT. Workconservation uses up any remaining bandwidth (§ 3.1.2).If Optimization (1) has a feasible solution for Ci , it createsa matrix fik [f k (u, v)] V V for each FlowGroup corresponding to its allocations on individual links of the WAN.Because f k (u, v) can be non-integers, a FlowGroup can besubdivided across many paths from u to v. We enforce thisusing an overlay in our systems design (§4).dik Di ,where dik is the k-th FlowGroup of Ci , and T (·) is the completion time of a FlowGroup. Hence, the slowest FlowGroupdik Di determines Γi . Our objective is given as:Minimize ΓiPki {End-to-end paths from fik allocations}G Updated G by subtracting fik allocationsÐC C for all C CFailedAllocate C on G using MCF Work conservationAllocate C \ C on G using MCFend procedure(1)Let us represent the bandwidth allocation of the k-th FlowGroup in Di (1 k Di ) with size dik between nodesu and v by f k (u, v) where u, v V . To minimize Γi , wegeneralize WSS [32] and MADD [33] to multiple paths toenforce equal rate of progress for all FlowGroups. For each3.1.2 Scheduling Multiple CoflowsWe now move on to considering multiple coflows in the offline scenario. Given multiple coflows, scheduling one coflow5

Pseudocode 2 Online Scheduling-Routinghigh complexity. Terra avoids this high complexity by categorizing the events, and only re-optimizing those FlowGroupsthat need update. For WAN bandwidth fluctuations, we consider ρ 25% to be the threshold for significant bandwidthchange that can cause a rescheduling, filtering out short-termfluctuations.1: procedure O NA RRIVAL(Coflows C, Coflow Ci )2: if D i , 1 then Ci has a deadline3:G ′ Scale down G by (1 α) Guarantee admitted4:G ′ G ′ - {fjk } admitted C j5:6:7:8:9:10:11:Γi Solve Optimization (1) for Ci on G ′if Γi ηD i thenReject Ci Reject Ci if its deadline cannot be metÐC C CiC ′ Sort C by decreasing D i and then by increasing ΓiallocBandwidth(J ′ , G)end procedure3.2ExtensionsSupporting Deadlines To provide guaranteed completionof a coflow Ci within its deadline (D i ), Terra uses admissioncontrol. We admit a coflow, if it can meet its deadline withoutviolating that of any other already-admitted coflow’s deadline – i.e., if its deadline is not further from its minimumcompletion time (Γi ) in the current WAN condition (line 7 inPseudocode 2). Note that we use a relaxation factor η (η 1)to mitigate the variability of WAN. However, when the bandwidth fluctuation is more than (η 1), no deadlines can beguaranteed. An admitted coflow is never preempted.Completing a coflow faster than its deadline has no benefit[33]. Hence, a known optimization is elongating its CCT untilthe deadline and sharing the remaining bandwidth with others.This can be done by scaling the f k (u, v) values by Γi /D i .Supporting DAGs and Pipelined Workloads Many dataanalytics jobs form DAGs of computation stages with communication stages or coflows in between [4, 5, 26, 51, 78].Job masters can submit requests for each coflow in a DAGindependently to Terra as dependencies are met. Job masterscan also submit a coflow with only some of its flows as soonas their dependencies are met, and then update the coflow toadd more flows if more dependencies are satisfied. This isuseful when the preceding computation tasks do not finish atthe same time. In this case, Terra tries to finish all the submitted flows of the coflow together, eventually finishing all theflows together. Our evaluation shows that this simple strategyperforms well (§6). Although it may be possible to performDAG-aware optimizations [43, 44], we consider that to be ajob master-specific decision and out of Terra’s purview.can impact the CCTs of all other coflows scheduled afterward. Consequently, a natural extension of the SRTF policy issorting the coflows by their Γ values and scheduling them inthat order (MINIMIZE CCTO FFLINE in Pseudocode 1). Thisrequires solving O(N ) instances of Optimization (1) duringeach scheduling round, which is activated by a coflow’s arrival, completion, and WAN events. We schedule a coflow ifall of its FlowGroups can be scheduled simultaneously.Work Conservation If the WAN is still not fully utilizedafter scheduling all coflows that can be scheduled in theirentirety, we run a max-min multi-commodity flow (MCF)formulation similar to [47] on a combination of coflows (prioritizing CFailed ) to ensure work conservation and maximizeWAN utilization (line 14,15 in Pseudocode 1).3.1.3 From Offline to OnlineSo far, we have assumed that all coflows arrive together andare known a priori. However, in practice, coflows arrive overtime as DAG dependencies are met. Additionally, WAN linkscan fail and its bandwidth can fluctuate. Scheduling coflowsin the FIFO order [32, 34] is a simple solution, but it can resultin head-of-line blocking [21, 33, 46]. Instead, preemption canminimize the average completion time [21, 33, 46].Starvation-Free Preemption We allow coflows withsmaller remaining completion time to preempt larger ones toavoid head-of-line blocking (Pseudocode 2). To avoid starvation issue that may arise with preemptive scheduling, weguarantee each coflow to receive some share of the network– specifically, α fraction of the WAN capacity is shared between preempted coflows (line 2 in Pseudocode 1). By default,α 0.1.Scalable Online Scheduling In the online scenario, manyevents that trigger re-optimization may arrive at arbitrarytime:1. Coflow being submitted as dependencies are met;2. FlowGroup finishes;3. Coflow finishes because all its FlowGroups finished;4. WAN topology changes because of bandwidth fluctuations/failures.Running the offline algorithm upon each event would cause4Terra: System DesignSo far we have focused on designing a scalable algorithmfor minimizing the average coflow completion time (§3). Inthis section, we discuss how to implement the solution in ascalable manner too. Furthermore, we consider how to makeit robust to WAN variabilities. We start with an architecturaloverview of the whole system and then provide insights intodesigning individual components.4.1Architectural OverviewAs shown in Figure 5, Terra has two primary components.A logically centralized Terra controller orchestrates all datatransfers. In each datacenter, a set of Terra agents coordinatewith the controller and transfer data on behalf of the jobs.Interactions between them can be summarized as follows:1. Job master(s) submit coflows to the Terra controller using6

4.31Gbps 80msTerraAgent1.5Gbps 100ms1GTerraAgentsm20s1bpMinimizing Scheduling Overhead Because Terra mustconsider routing, it cannot use existing topology-agnosticheuristics [31, 33]. However, the total number of flows in aGDA coflow adds significant time complexity to an integerlinear program-based solution. Consequently, Terra leveragesthe FlowGroup abstraction [58] that allows us to remove theintegral constraints, leading to a practical solution (§3). Eachscheduling round takes O(100) milliseconds for topologiesdescribed in [47] and [53], and O(1) seconds for larger topologies with O(10) datacenters (§6.6). Given that many GDAjobs take several minutes to complete [71], Terra is not timeconstrained in decision making and dissemination. Finally,because most traffic come from

For example, SWAN [47] updates WAN configurations every 5 minutes. Because Terra is integrated with the SD-WAN controller, unlike existing solutions, it can monitor and react to these changes. High-priority, user-facing, and deadline-sensitive traffic are prioritized by WAN managers [57, 58], including Terra. So we consider a link's bandwidth .