ROSE: Cluster Resource Scheduling Via Speculative Over-subscription PDF Free Download

1y ago

24 Views

1 Downloads

2.29 MB

12 Pages

Report/dmca

Download PDF

Transcription

ROSE: Cluster Resource Scheduling via SpeculativeOver-subscriptionXiaoyang Sun1 , Chunming Hu1 , Renyu Yang21 , Peter Garraghan3 , Tianyu Wo1 , Jie Xu21 , Jianyong Zhu1 , Chao Li41Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, China2School of Computing, University of Leeds, UK3School of Computing and Communications, Lancaster University, UK4Alibaba Group, j.xu}@leeds.ac.uk; p.garraghan@lancaster.ac.uk; chao.li@alibaba-inc.comAbstract—A long-standing challenge in cluster scheduling is toachieve a high degree of utilization of heterogeneous resources ina cluster. In practice there exists a substantial disparity betweenperceived and actual resource utilization. A scheduler mightregard a cluster as fully utilized if a large resource requestqueue is present, but the actual resource utilization of thecluster can be in fact very low. This disparity results in theformation of idle resources, leading to inefficient resource usageand incurring high operational costs and an inability to provisionservices. In this paper we present a new cluster schedulingsystem, ROSE, that is based on a multi-layered schedulingarchitecture with an ability to over-subscribe idle resources toaccommodate unfulfilled resource requests. ROSE books idleresources in a speculative manner: instead of waiting for resourceallocation to be confirmed by the centralized scheduler, it requestsintelligently to launch tasks within machines according to theirsuitability to oversubscribe resources. A threshold control withtimely task rescheduling ensures fully-utilized cluster resourceswithout generating potential task stragglers. Experimental resultsshow that ROSE can almost double the average CPU utilization,from 36.37% to 65.10%, compared with a centralized schedulingscheme, and reduce the workload makespan by 30.11%, withan 8.23% disk utilization improvement over other schedulingstrategies.Index Terms—cluster scheduling, resource management, oversubscription[1] [2] [3] [4] [5] [6] [7] [8]I. I NTRODUCTIONImproving the resource utilization of clusters is a longstanding issue that has become increasingly important tosatisfy global demand for Internet services such as web search,social networking, and machine learning applications. Variousapplications often exhibit diverse workload characteristics interms of task scale and resource heterogeneity. To cope withsuch diverse characteristics, modern cluster management systems, e.g. those in [9][10][11][12], are designed to effectivelyallocate jobs onto machines, and manage various resource requirements as a unified pool of underlying resources. However,there exists a substantial disparity between resource usagerequested by jobs and the actual cluster resource utilization.For instance, studies of production clusters from Twitter andGoogle show that typical disparities are around 53% and 40%for CPU and memory respectively [13]. Therefore, the actual Dr. Renyu Yang is the corresponding author.CPU utilization is between 20% and 35% and the memoryutilization is from 20% to 40% [14]. A scheduler might consider a cluster as being fully utilized if a large resource requestqueue is present, even when the actual resource utilization ofcluster machines is in fact very low. This disparity results inthe formation of idle resources, producing inefficient clusterresource usage and incurring higher operational costs andan inability to provision service. Additionally, computationintensive batch DAG jobs are increasingly common withindata-parallel clusters and should be handled properly. Thesebatch jobs are typically segmented into a large number of taskswhich only last sub-second or seconds duration [11][15].The resource utilization of a cluster may be improvedthrough over-subscription (also known as overbooking) [16],that enables jobs to be executed within a cluster by leveragingresources from existing jobs that are presently underused oridle. This technique has been heavily exploited at differentlevels of a cluster, including the kernel [8], the hypervisor [17],and the cluster resource scheduler. For example, the resourcescheduler would launch speculative tasks for existing jobs,where the tasks use the over-subscribed resources and run ina best-effort manner.A centralized resource scheduler, such as YARN and Mesos,performs over-subscription decision making [4] [5] througha central manager that transmits the information about updated idle or revocable resources to a job via piggybackingon regular heartbeat messages within the cluster. However,several heartbeat messages are required for transmission ofload information, resource scheduling, and dispatching ofspeculative tasks. The duration of this transmission processbecomes even longer when task retry and re-submission areneeded. As heartbeats are typically configured to be sent at3s intervals [9], it is likely that these instructions will nolonger reflect nor match the current cluster resource usage.The reason for this mismatch in cluster resource usage isprimarily due to creation of potentially thousands of newtasks, diverse resource consumption patterns of existing tasks,and tasks that have second or even sub-second completiontimes. This consequently results in sub-optimal reduction ofoverall job makespan that could be significantly improved. Adecentralized scheduler may resolve some of the issues byassigning speculative tasks randomly [18] or on per application

Memory Usage (TB)250 500 750 1000basis [19]. However, this approach is highly dependent on accurate queue delay times, which requires customers to provideprecise job completion times. It is unfortunately infeasible inpractice when considering the lack of customer knowledgeand unknown job types. What is needed is an over-subscriptionapproach that is capable of overcoming the late delivery of idleresources for speculative tasks, as well as exploiting clusterdiversity in terms of heterogeneous resources and dynamicresource usage.In this paper we propose ROSE, a resource oversubscription framework that provides an efficient and speculative Resource Over-subscribing SchedulEr for cluster resourcemanagement, which improves utilization whilst reducing jobend-to-end time. A ROSE job leverages idle resources directly from the node controller daemon situated within everymachine and creates speculative tasks, instead of waiting forresources to be made available from the centralized resourcemanager. These speculative tasks are then launched withinmachines that are determined to be most suitable for resourceover-subscription through a multi-phase filtering process, considering estimated load, correlative workload performance, andqueue states. Such information is incrementally and timelysynchronized to the application-level scheduler. A thresholdcontrol is used to determine and control launching speculativetasks, thereby allowing for globally maximizing the reusabilityof cluster resources. Furthermore, task rescheduling is alsoemployed to reduce the head of line blocking and resultantstraggler manifestation [20] within speculative tasks. ROSEcan be integrated into any multi-layer cluster system, and iscapable of over-subscribing multi-dimensional resources (e.g.,CPU, memory, network, etc.). We implemented and evaluatedROSE within the open-source resource management systemYarn[1] and Alibaba’s cluster management system – Fuxi [11]– in order to study improvements against comparative oversubscribing strategies. The main contributions of this workare: A general resource over-subscribing architecture for cluster resource management that enables idle resources to beleveraged to create speculative tasks. A multi-phase machine filtering process that considersdiverse factors (e.g., task-level rating and machine states)to locate optimal machines for speculative execution. A runtime threshold control with timely task rescheduling that provides speculative task management.II. R ESOURCE U TILIZATION I SSUESCentralized scheduling framework. Modern resourcescheduling systems typically decouple the resource management layer from the job-level logical execution plans to en-0246AMPlanedNMReal8Time (hour)1012(a) Cluster memory usage1201008060402000.0 0.2 0.4 0.6 0.8 1.0(b) Cluster memory CDFFig. 2. Breakdown of total cluster memory usageTABLE IC ENTRALIZED OVER - SUBSCRIPTIONover-estimationMemory Resource(GB)Fig. 1. Daily memory utilization of a 10,000 node clusterTotalResRMPlaned1.00.80.60.40.20.0NMReal / RMPlanedNMReal / TotalResfragments02000 4000 6000 8000 10000Server IDFig. 3. Idle 5227234237253233.211.7%hance system scalability, availability and framework flexibility.For instance, YARN[1] and Fuxi[11] share the following components: Resource Manager (RM) is the centralized resourcecontroller; tracks resource usage, node liveness, enforces allocation invariants; and arbitrates contention among tenants.The component is also responsible for negotiation between theavailable resources within the infrastructure and the resourcerequests from Application Masters. Application Master(AM)is a application-level scheduler, which coordinates the logicalplan of a single job by requesting resources from the RM,generating a plan from received resources, and coordinatingthe execution. Node Manager(NM) is a daemon process withineach machine and responsible for managing local tasks (including launch, suspend, kill, etc.).Production-level cluster resource usage. Logical resourceutilization is a metric often studied to measure schedulerperformance. Higher utilization implies more efficient scheduler decision making and faster job completion. We profiledthese characteristics by exploring the resource behavior of aproduction-level cluster to ascertain idle resource occurrence.Due to cluster’s cyclical behavior, we randomly select aconsecutive 5-days period and analyze over 10,000 machinesat Alibaba to study daily usage patterns. The metrics capturedinclude TotalResource that represents the total available resource (referring to memory for demonstration), RMPlannedindicating the total amount of assigned memory to all AMs(job managers) after resource assignment, AMPlanned representing the amount that is obtained and used by all AMs, andNMReal showing the total resource consumed by all containersat runtime.As shown in Fig. 1, the mean memory utilization of thecluster is 30%, with a deviation of 14.2%. Furthermore, a dailytemporal usage pattern can be visually observed in the clusterdue to conducting large-scale batch processing between 0:00to 08:00 (CST). There exist sufficient idle cluster resourcesthat can be reused. Fig. 2(a) shows the memory consumptionof the entire cluster within a 12-hour period. Approximately97.1% of memory on average can be reached by RMPlanned,

ResourceManager(RM)ROSE JobMaster(AM)No available resourcesClusterAggregator(CA)EnqueuedNode Manager(NM)ControllerNode Manager(NM)ControllerControllerNode Manager(NM)Fig. 4. Workflow and inter-component interactionindicating that most memory is visible and initially utilizedby the scheduler. Moreover, AMPlanned is extremely closeto RMPlanned, indicating that the majority of job requestswithin the cluster are eventually confirmed with correspondingresources granted. Nevertheless, a large disparity between theRMPlanned and NMReal can be perceived, resulting in actualresource utilization being very low. Fig. 2(b) reveals that over80% sampled data points experience no more than 55% actualmemory usage compared with total assigned memory.Overestimation & fragmentation. We further discover thatthis under-utilization is resultant of user overestimation andresource fragmentation. For example, at time point (27,500s),we collected the total amount reserved by all running jobs andcalculated the fragmentation on each machine across the cluster. Fig. 3 depicts statistics ordered by the resource amount,showing that more than 80GB memory could be reused withinmost machines (with 132GB memory per server). In fact, userstypically request excessive amounts of resources to handleworkload bursting to avoid SLA (Service Level Agreement)violation. Fragmentation is also a common occurrence inclusters, especially during periods of high resource requests.From our analysis, there exists a significant gap between actualand predicted resource allocation that remains unused.Resource mismatch. To reuse such idle resources, centralizedover-subscription scheduling (e.g., Hadoop 3.0 [4]) is proposed. To demonstrate the consequential limitations, we submit 2,000 speculative tasks to a Hadoop 3.0 cluster and observethe launching number, varying the customizable max queuelength(mQL) on NMs. However, the decision and heartbeatpiggybacking mechanism will result in the delayed messagedelivery and mismatch of the latest resource usage duringheartbeat intervals and lead to a great number of task redistributions. For example, on average 19.5% of submittedtasks are excluded from the queue and 65.3% have to be rescheduled even if they are allowed to enqueue. Merely 11.7%on average can be successfully launched. Additionally, dueto the inherent workflow in centralized resource scheduling,unsuccessful tasks require several heartbeat intervals beforethey can be re-submitted, re-scheduled and dispatched onto anew NM. Such inefficiency greatly degrades the performanceof resource over-subscription and job execution.Challenges & requirements. A challenge for schedulingcomputation-intensive batch jobs is reducing the probability ofover-subscription violation and the consequent compensation(such as rescheduling or evictions of speculative tasks) inorder to improve job end-to-end performance and systemutilization. Another challenge is how to timely detect andexploit cluster diversity in terms of heterogeneous resourcesand dynamic resource usage, and how to realize appropriatespeculative task execution using idle resources optimally. Toaddress these challenges, the first objective is [R1] to designa generalized resource over-subscribing mechanism that caneffectively create speculative tasks compatible with establishedcentralized resource management pipeline; The mechanismshould effectively find and timely deliver the idle resourceinformation for launching such tasks. Additionally, the systemshould [R2] fully exploit the dynamicity of the cluster, aswell as workload/server heterogeneity [13][21] to reduce taskeviction occurrence, and harness idle cluster resources including fragmented resources and allocated (yet idle) resources.To underpin the efficient resource discovery and optimaldetermination of task dispatching, resource scheduling shouldalso [R3] aggregate and consider both application-level andmulti-dimensional system information; and flexibly managethe life-cycle of speculative tasks on specific machines.III. R ESOURCE -OVERSUBSCRIPTION BASED S CHEDULINGA. Oversubscription-based SchedulingTo improve the scalability and deal with the heartbeatdependent issues in the centralized scheduling, decisions forover-subscription and speculative task launching are made ineach job application master independently. We enable speculative tasks to be launched and executed via leveraging idleresources through the following steps:(Step 1) A ROSE job requests resources from the RM.(Step 2) Once the resources in the cluster have been allocated,no further regular resources are assigned to jobs. (Step 3)A job attempts to request additional resources directly fromNMs in a speculative manner, rather than waiting for theemergence of available resources released by the RM. Thejob then requests to launch speculative tasks in machines thatare determined by the Cluster Aggregator (CA) to be mostsuitable to oversubscribe resources. (Step 4) To avoid intertask performance interference, speculative tasks run at lowerpriorities and are preemptable compared to currently executingtasks in the machine. NM maintains a local waiting queue tomaintain the order of submitted speculative tasks while theNM judges whether the speculative tasks can be acceptedthrough a controller according to runtime system information.For example, as long as the maximal utilized resource doesnot surpass the upper-bound threshold, additional workloadcan be launched onto the physical machine. (Step 4) Onceaccepted, the speculative tasks will be enqueued and waitto be scheduled by the threshold controller, otherwise, theseattempts will continue periodically.It is worth noting that the procedure of resource oversubscription to speculative tasks is decoupled from the central-

Cluster Aggregator(CA)MachinesLoadAggregatorNode Manager (NM)SystemMonitorScores N1, 0.85 rROSE JobMasterApplication Master(AM)NodeListMaintainerTaskScorerTask InfoFig. 6. For each speculative task: the load on the current scheduled machineand the minimum load(defined as optimal) in the cluster at that momentTime-outDetectorWaiting QueueOverSubscrControllerRegular TasksSpeculative TasksUsed ResourceIdle ResourceMaintainerLocal NodeListFig. 5. Architecture and designized resource manager and dispersed into separate job masters.In reality, the RM does not perceive speculative tasks andtheir utilized resources. The idle resources are actually takenand re-used by the speculative tasks without being detectingby regular tasks. The fragment or idle resources can be fullyutilized by the speculative task, and the task instance can beimmediately scheduled to execution machines. Consequently,these instances can run in advance, greatly accelerating therunning process. Although there is a risk that such tasks maybe killed, holistically it is still more effective at shortening jobduration as well as utilizing idle cluster resources.B. ROSE Architecture OverviewTo achieve [R1], we propose the architecture to not onlyguarantee that sufficient resources can be obtained by specificjobs through centralized resource management, but also reusesuncollected resources by executing speculative tasks. ROSEis designed to be complementary and compatible to existingprotocols between the AM and RM, and thus can enhanceexisting two-layer scheduling systems. Generally speaking, toprevent excessive resource over-subscription and the resultantperformance degradation, we design a multi-level thresholdcontroller at runtime within each NM to determine the timing of launching speculative tasks and to limit the oversubscription amount and running number of speculative tasks.To maximize the effectiveness of resource over-subscription,ROSE leverages a machine selection mechanism to selectcandidate destinations for tasks.Fig. 5 illustrates an overview of the ROSE system architecture, which contains of two scheduling modes in AM. Thefirst mode is to request resources from the centralized resourcemanager and then launch relevant tasks or make a resource reallocation, which can guarantee that dispatched tasks can beexecuted immediately without resource conflicts. The secondmode is the over-subscription mode to proactively negotiatewith one NM whether it can accept speculative tasks. To support this, AM locally maintains a replica of candidate machinecollection that is periodically coordinated and incrementallysynchronized by the maintainer in the Cluster Aggregator.Even in case of long task starvation due to queuing or eviction,AM in ROSE can instantly re-submit the speculative tasksaccording to the maintained machine collection. AM adaptsa selection process to realize task placement in the event ofa large task number centering in a minority of machines.The ROSE client can decide whether to turn on distributedscheduling mode and customize the ratio of tasks that areallowed for over-subscription per application.To achieve [R2] and [R3], Cluster Aggregator (CA) is designed to be a module decoupled with RM, to monitor runtimeinformation and aggregate the task-level status. CA collectsmachine load from each NM and aggregates each machinescore based on the estimation from each AM (Section IV-A).Based on this information, CA can determine a set of candidatemachines in a certain time period and then AM launchesspeculative tasks onto machines according to the results. InCA, we use a multi-phase machine filtering mechanism toselect and rank machine candidates prior to oversubscribingresources, taking into consideration machine load, correlativeworkload performance, and queue states for decision-making(Section IV-B).ROSE Job Master is a specific AM that leverages theproposed over-subscription mechanism to compensate underutilized resource requests. First, the Task Scorer component isused to rate machines based on the internal state and the statetransitions of all tasks within the job. Therefore, the task-levelscoring becomes a valuable criterion during the machine filtering process within the CA. Additionally, as the main actor andbeneficiary of the resource over-subscription, AM coordinatesbetween the regular resource scheduling and over-subscriptionstrategy and adaptively upgrades the resource priority/level viathe Over-subscription Controller when resources are granted orpreempted. Due to variations in cluster states, it is possible thatthe scheduling decision making might become sub-optimal.For example, some tasks have already been dispatched to aspecific machine but experience head of line blocking. To thisend, a timing-out re-scheduling might be triggered to mitigatethe task starvation by the Timeout Detector.Apart from the regular task management, the queue management to order and control speculative tasks must be imposedin the NM. Tasks from different AMs are enqueued and waitto be launched. This is triggered by the node controller thatdetermines if the node capacity allows for further oversubscribing and which speculative task can be launched based onreal-time queue status and machine loads. This is performedby: a) component monitoring to capture real-time metrics andreport to the CA (Section IV-A), and b) threshold controllerresponsible for process management and runtime resourceisolation (Section V-B).

IV. L OAD AGGREGATION AND M ULTI -P HASE F ILTERINGWe firstly demonstrate that the dispatched speculative taskscan improve cluster resource utilization efficiency. We conduct experiments to investigate whether speculative tasks areassigned to machines in an optimal manner. Each task isexpected to be assigned onto machines with the least load forthe sake of load-balancing. To quantify this desired outcome,we profile all speculative tasks by contrasting the utilizationof the current machine where the task is running againstthat of the lowest utilized machine across the cluster in thesame measurement interval. Fig. 6 shows the optimal andactual value of different load dimensions (CPU, memory,load and disk utilization), with the ideal CPU value (orangesegment) merely accounting for no more than 50% of tasksare optimally allocated to machines in terms of utilization.In terms of memory, this gap is even larger, indicating thatthe task allocation strategy can be improved substantially. TheCDF shown in Fig. 6 also demonstrates that more than 62%tasks can be placed onto better machines with at least halfless loads. Therefore, selecting suitable machines is highlypreferable to enhance cluster utilization efficiency.Algorithm 1 Load Level Approximation (LLA)Input: D – continuous tracedata of a machine metrick – the pre-defined granularity of accuracy and the default value is 5;Output: v: a predicted tendency value1: if D is monotonous then :2:return the last element of D3: if card(D) k then :4:return eliminateOutlierM ean(D)5: let Si , i [1, k]6: for each d D do7:Si Si {d}8:if card(Si ) ⌈card(D)/k⌉9:i i 110: D′ {eliminateOutlierM ean(Si) i [1, k]}11: if D′ is monotonous then :12: return the last element of D′13: else :14: return eliminateOutlierM ean(D′ )Function : eliminateOutlierM ean(I)Input: I – several continuous sample data;Output: m – mean with eliminating outliers;1: Q1 and Q3 are the lower and the upper quartiles respectively2: T {d d [Q1 k(Q3 Q1 ), Q3 k(Q3 Q1 )]} (customarilywith k is 1.5) T3: return m card(T )A. Runtime Load Monitor and AggregationLoad Monitor. To support precise machine selection at runtime while measuring the resource utilization and clusterhealth, a load monitor is proposed by capturing several machine metrics for each machine. For instance, we monitor theresource utilization(cpu util, memory util, disk util, and network received/transmit) and waiting/running container numbers for both regular and speculative containers. The monitoris currently implemented in the NM as a submodule and can bedeployed as a underlying service for operational management.Considering the data volume of the monitored data and theresultant network pressure, we aggregate the load informationduring a fixed time interval X seconds and periodically collectand transmit it every Y seconds wherein X and Y indicatethe tradeoff between the data precision and monitor overhead.Users can manually configure these parameters according tomachine bandwidth or other requirements. We also have todepict the approximate load level based on the accumulatedhistorical data and the value will be utilized as a metric in thedecision making.Scalability Considerations. The rate of data generation willresult in scalability issues for data processing and subsequenttask allocation. For example, if we set X to be 2s, there will be30 records within a minute, and over 300,000 pieces of data ina 10,000-node cluster that must be collected and processed toperform scheduling. It is still extremely challenging to timelyprovide an estimated approximation of load for each machineduring a specific time frame, particularly when the fluctuationsof demand and workload frequently changes over time. Tosolve this problem, we propose a piecewise-based algorithmand de-noising load acquisition method to accelerate the runtime load evaluation without substantial precision degradation.The NM will locally conduct the calculation before sending theintermediate results to the CA. The stress of transmission andcalculations can be significantly mitigated by the approximatedestimation and decentralized design.Load Estimation. Algorithm 1 depicts the load level estimation based on the prior monitoring information and denoising process. We define the basic calculation unit asconducted over a slide window with fixed number of datapoints. To accelerate the estimation, we divide the entiretime period into k segments alongside the timeline (Line1-5) and the calculation can be parallelized on differentsegments (Line 6). For each segment, we calculate the averageload by de-noising the sample data through a calculation(eliminateOutlierM ean) based on Tukey’s boxplot [22] [23]to pinpoint and eliminate potential outliers described in Alg. 1.Moreover, ROSE can be compatible with other statisticalmethods of outlier detection[24] [25] to handle the data withasymmetric distribution. It is noteworthy that all segmentscan conduct the average evaluation in parallel. After the localcalculation, we can obtain the value set D′ containing the kaverage values. Subsequently if the elements are monotonous,we can easily determine the target value according to thetendency (Line 12). Otherwise, we regard the mean valuewith eliminating outliers of D′ as the estimated result (Line14). Due to parallelism and approximation, the load can beefficiently calculated. Compared with the decision makingbased on loads at a single measured instance, load estimationbased on recent slide window can more precisely capture theload variation and obtain suitable machines.B. Multi-Phase Machine Filtering MechanismTask Behavior Aware Machine Rating(TBAMR). To accurately reflect the state of task execution, each machine isassigned a satisfaction level reflecting their ability to successfully execute speculative tasks, calculated through the use

Algorithm 2 Task Behavior Aware Machine Rating (TBAMR)Algorithm 3 Multi-phase Machine Filtering (MMF)Input: AM – terminated jobs in a specific time period, AMi represents thei-th job’s application master ;T askij – a set of all tasks, j {1, . . . Ni } represents j th tasks of AMi ;Output: Score – a synthetic score for all machines1: for each AMi AM do2:let penaltyScore 3:for each T askij T aski do4:if T askij ’s status {f ailed, crashed, tailed, killed} then5:m getHostID(T askij )6:penaltyScorem penaltyScorem 17:penaltyScore′ topK(penaltyScore)8:for each psi penaltyScore′ do9:Scorei Scorei psi10: return ScoreInput: (M, AM, T ask, mL)Output: C – candidate machines fit for oversubscribed resources;1: LM {LLA(Mi ).normalize() Mi M }2: M achScore TBAMR(AM, T ask).asendSortByScores3: B1 {Mi M Si topK(M achScore) }4: B2 {Mi Mi M and Mi is disconnected}5: B3 {Mi ( j)( LMij the j th threshold)}6: M ′ M B1 B2 B37: let candidateInf o 8: for each Mi M ′ do 9:lIndex LMil · loadF ilter 10: qIndex LMiq · queueF ilter11: candidateInf o candidateInf o (Mi , lIndex, qIndex)12: C ′ candidateInf o.ascendSortBy(lIndex).topK(d mL)13: C ′′ C ′ .ascendSortBy(qIndex).topK(mL)14: C {m (l) (q)(m, l, q) C ′′ }15: return Cof historical job execution data. Specifically, any action thatnegatively impacts task execution such as long-tail, failure,task kill and eviction, is regarded as negative behavior thatreduces the machine satisfaction score. This machine ratingis calculated for each machine by all AMs. Eventually, eachAM reports the perceived machine scores to the CA, usingthe detailed pseudo-code as shown in Alg. 2. The procedureis an important step that will be integrated into the machineselection (Alg. 3, Line 2).Multi-phase Machine Filtering. To select the most suitablemachine set where

and the cluster resource scheduler. For example, the resource scheduler would launch speculative tasks for existing jobs, where the tasks use the over-subscribed resources and run in a best-effort manner. A centralized resource scheduler, such as YARN and Mesos, performs over-subscription decision making [4] [5] through