Evaluating Scalability And Efficiency Of The Resource And Job .

Transcription

Evaluating scalability and efficiency of theResource and Job Management System onlarge HPC ClustersYiannis Georgiou and Matthieu Hautreux

PlanIntroduction-MotivationsExperimental MethodologyEvaluation Results - DiscussionsScalability in terms of number of submitted jobsScheduling Efficiency in terms of Network Topology aware placementScalability in terms of number of resourcesConclusions - Future Works2 / 33

PlanIntroduction-MotivationsExperimental MethodologyEvaluation Results - DiscussionsScalability in terms of number of submitted jobsScheduling Efficiency in terms of Network Topology aware placementScalability in terms of number of resourcesConclusions - Future Works3 / 33

IntroductionFacts for large HPC supercomputersIIncreased number of resourcesISubmission of large number of jobsILarge network diameters and more complexdesigns4 / 33

IntroductionFacts for large HPC supercomputersIIncreased number of resourcesISubmission of large number of jobsILarge network diameters and more complexdesignsResource and Job Management System(RJMS)The goal of the RJMS is to satisfy users’ demands for computation andassign user jobs upon the computational resources in an efficient manner.5 / 33

IntroductionFacts for large HPC supercomputersIIncreased number of resourcesISubmission of large number of jobsILarge network diameters and more complexdesignsResource and Job Management System(RJMS)The goal of the RJMS is to satisfy users’ demands for computation andassign user jobs upon the computational resources in an efficient manner.RJMS Issues for large HPC supercomputersIResources ScalabilityILarge workload managementINetwork topology aware placement efficiency6 / 33

HPC supercomputersTera-100 supercomputerIIn production since November 2010I17th in June’s 2012 Top500 listI1.25 Pflops theoreticalCurie supercomputerIIn production since March 2012I9th in June’s 2012 Top500 listI1.6 Petaflops theoretical7 / 33

MotivationsEvaluate the behaviour of SLURM regarding:ILarge workload managementINetwork topology aware placementIScalability in terms of number of nodes to manage8 / 33

MotivationsEvaluate the behaviour of SLURM regarding:ILarge workload managementINetwork topology aware placementIScalability in terms of number of nodes to manageTune Tera100 & Curie configurations for best performancesICore level allocation, fine topology description, backfill scheduler9 / 33

MotivationsEvaluate the behaviour of SLURM regarding:ILarge workload managementINetwork topology aware placementIScalability in terms of number of nodes to manageTune Tera100 & Curie configurations for best performancesICore level allocation, fine topology description, backfill schedulerEvaluate the behaviour of SLURM for larger machinesIExperimentations beyond the current scale of existing machines10 / 33

PlanIntroduction-MotivationsExperimental MethodologyEvaluation Results - DiscussionsScalability in terms of number of submitted jobsScheduling Efficiency in terms of Network Topology aware placementScalability in terms of number of resourcesConclusions - Future Works11 / 33

Experimental MethodologyMethod for Throughput EvaluationsITera100 nodes during maintenance periodsIReal-Scale experimentationsISubmission bursts generation scripts12 / 33

Experimental MethodologyMethod for Throughput EvaluationsITera100 nodes during maintenance periodsIReal-Scale experimentationsISubmission bursts generation scriptsMethod for Efficiency and Scalability EvaluationsIEmulated experimentations through multiple-slurmd modeIIIITera100 nodes during maintenance periodsIIIIMultiple virtual SLURM nodes per physical nodesEnhanced to ease scalable deployement of simulated clustersPatches accepted in SLURM main branchAutomatic generation of simulated clusters configurationAutomatic start of all the SLURM daemonsFully automated to enable reproduction of same experimentsAutomatic run of a single benchmarkIIBased on a known model: ESPVariations of ESP used with different job counts and sizes13 / 33

Experimental MethodologyMethod for Throughput EvaluationsITera100 nodes during maintenance periodsIReal-Scale experimentationsISubmission bursts generation scriptsMethod for Efficiency and Scalability EvaluationsIEmulated experimentations through multiple-slurmd modeIIIISLURM scheduling and topologyTera100 nodesplacementduring maintenanceefficiency;periodsthese functionsIIIIMultiplevirtual SLURMnodes per dto ease scalable deployement of simulated clustersI The goal is to evaluate the behavior ofPatches accepted in SLURM main branchAutomatic generationsimulatedconfigurationtake placeofasinternalclustersmechanismsonAutomatic start of all the SLURM daemonsthe slurmctld controller deamon.Fully automated to enable reproduction of same experimentsI Similar results of workloads executionAutomatic run of a single benchmarkwith real and emulated scaleI Based on a known model: ESPexperimantations.IVariations of ESP used with different job counts and sizes14 / 33

RJMS evaluation with workload injectionESP benchmark uantitative evaluation of launching and scheduling via a singlemetric: the smallest elapsed execution time of a representative workloadIComplete independence from the hardware performancesIautomatic adaptation on cluster sizesIparametrized Gaussian model of submissionIparticular number of jobs derived from different classes with variouscharacteristicsSpecific ParametersINo real computation (sleep jobs)15 / 33

Proposed models for workload injectionLight-ESP modelIshrinked execution time than ESPI230 jobs and 14 classes of jobsILargest jobs use 50% of the machine (ignoring Z jobs)ISmallest jobs use 3.125% of the machineParallel Light-ESPx10 modelIshrinked execution time than ESPI10x smaller job sizesIno Z jobs (13 classes)I10x more jobs (2280 jobs parallel launches)ILargest jobs use 5% of the machineISmallest jobs use 0.3% of the machine16 / 33

Proposed models for workload injectionBenchmarksNormal-ESPJob TypeABCDEFGHIJKLMZTotal Jobs /Theoretic Run TimeLight-ESPParallel Light-ESPFraction of job size relative to system size(job size for cluster of 80640 cores)Number of Jobs / Run Time (sec)0.03125 (2520) / 75 / 267s0.06250 (5040) / 9 / 322s0.50000 (40320) / 3 / 534s0.25000 (20160) / 3 / 616s0.50000 (40320) / 3 / 315s0.06250 (5040) / 9 / 1846s0.12500 (10080) / 6 / 1334s0.15820 (12757) / 6 / 1067s0.03125 (2520) / 24 / 1432s0.06250 (5040) / 24 / 725s0.09570 (7717) / 15 / 487s0.12500 (10080) / 36 / 366s0.25000 (20160) / 15 / 187s1.00000 (80640) / 2 / 100s0.03125 (2520) / 75 / 22s0.06250 (5040) / 9 / 27s0.50000 (40320) / 3 / 45s0.25000 (20160) / 3 / 51s0.50000 (40320) / 3 / 26s0.06250 (5040) / 9 / 154s0.12500 (10080) / 6 / 111s0.15820 (12757) / 6 / 89s0.03125 (2520) / 24 / 119s0.06250 (5040) / 24 / 60s0.09570 (7717) / 15 / 41s0.12500 (10080) / 36 / 30s0.25000 (20160) / 15 / 15s1.00000 (80640) / 2 / 20s0.003125 (252) / 750 / 22s0.00625 (504) / 90 / 27s0.05000 (4032) / 30 / 45s0.02500 (2016) / 30 / 51s0.05000 (4032) / 30 / 26s0.00625 (504) / 90 / 154s0.01250 (1008) / 60 / 111s0.01582 (1276) / 60 / 89s0.003125 (252) / 240 / 119s0.00625 (504) / 240 / 60s0.00957 (772) / 150 / 41s0.01250 (1008)/ 360 / 30s0.02500 (2016) / 150 / 15s1.00000 (80640) 2 / 20s230 / 10773s230 / 935s2282 / 935s17 / 33

PlanIntroduction-MotivationsExperimental MethodologyEvaluation Results - DiscussionsScalability in terms of number of submitted jobsScheduling Efficiency in terms of Network Topology aware placementScalability in terms of number of resourcesConclusions - Future Works18 / 33

Jobs High throughputLarge workload managementILarge number of directly elligible jobsIIIII10k jobs from 1 to 8 cores to fill an idle machineLarge number of pending job launches1 big allocation to ensure that no new jobs can be scheduled10k jobs from 1 to 8 cores to fill an idle machineLarge number of job terminationsII10k jobs from 1 to 8 cores to fill an idle machineGlobal cancellation when every jobs are running19 / 33

Large workload managementHigh throughput of directly eligible job launchesIComparison of 2 SLURM configurationsIIdefault mode: 1 schedule try per incoming jobdefer mode: 1 schedule try every 60 seconds (configurable)Instant Throughput for 10000 submitted jobs (random 1 8 cores)upon a 2020nodes (32cpu/node) cluster submission NO defer strategy250200150Number of Jobs050100100500Number of Jobs150300Instant Throughput for 10000 submitted jobs (random 1 8 cores)upon a 2020nodes (32cpu/node) cluster submission WITH defer ationsIdefer mode greatly increase the submission rateIbut induces latency between submission and executionInot good choice in case of single interactive jobs20 / 33

Large workload managementLarge number of pending job submissionsIHeavy load induced on the controllerIIIEvery job submission triggers the scheduler logic even in defer modePatch done to remove the scheduling logic greatly improves thesubmission rate (landed in SLURM main branch)Lower rate than for elligible jobsINo longer the case, corrected in the main branchInstant Throughput for 10000 submitted jobs (random 1 8 cores) in Waiting Staupon a 2020nodes (32cpu/node) cluster AFTER defer optimization301020Number of Jobs25201510050Number of Jobs303540Instant Throughput for 10000 submitted jobs (random 1 8 cores) in Waiting Stateupon a 2020nodes (32cpu/node) cluster BEFORE defer 000Time(sec)21 / 33

Large workload managementLarge number of job terminationsIHeavy load due to a large number of completion messages involving acomplexe logic in SLURM internalsII20 minutes of unresponsiveness of the systemPatch from the community to reduce the complexityIIgreatly improves the termination rateNo major unresponsiveness after applying the patchInstant Throughput for 10000 terminating jobs (random 1 8 cores)upon a 2020 nodes (32cpu/node) cluster (after backfill optimization)1008060Number of Jobs020403020100Number of Jobs4012050Instant Throughput for 10000 terminating jobs (random 1 8 cores)upon a 2020 nodes (32cpu/node) cluster (before backfill (sec)22 / 33

Network Topology aware placementSLURM topology aware pluginIto support network tree topology awareness of jobsIbest-fit approach pack the jobs on the minimal number of switchesmaximizing the amount of remaining free switchesIfile for the description of the network design topologyCurie cluster network designICurrent Infiniband technology is based on a 36 ports ASIC that madefat-tree topologies constructed around a pattern of 18 entities.IActual design: 280 Infiniband leaves aggregated by 18 324 portsphysical switches grouping the 5040 nodes in 16 virtual interme- diateswitches.23 / 33

Network Topology aware placementEmulation of Curie using 200 Tera-100 physical nodesI5040 nodes with an Infiniband fat tree (18-ary fat tree)IThe real topology of the emulated cluster is different than the physicalnodes topology. Not an issue since the goal is to evaluate the behaviorof SLURM scheduling and not the behavior of the workload execution.10 runs of Light-ESP benchmark for 4 scenariosIno topologyIBasic topology:single virtual switchIMedium topology: intermediate level only (324 nodes switches only)IFine topology: all levels (324 nodes switches, 18 nodes lineboards)Comparison of placement versus optimal placementIOptimal numbers of intermediate and leaf levels switchesIEvaluate random success of not sufficiently defined descriptions toprovide good solution24 / 33

Network Topology placementOptimal placement respecting topological constraints for 230 jobs of Light-ESP benchmarkupon a 5040node(16cpu/node) cluster with SLURM and different topology placement strategiesCDF on Waiting time for Light ESP benchmark (230 jobs)upon a 5040 nodes (16 cpu/node) cluster with variation on topology strategiesJobs [%]80topo finetopo mediumtopo basicno 00.040020100200300400Wait time 323s)(1370s)(1315s)Topology placement strategies (with total Execution time of Light-ESP in sec)Topology effectivenessIFine topology not efficient in providing the optimal number ofintermediate levels switchesIIIBecause the best fit selection of switches is only made at leaf levels whenthe system is fragmentedPruned topology, like Tera-100, should not use such a fine descriptionFull fat tree can use it as it still gives very good results for leaf levels25 / 33

Nodes ScalabilityScalability evaluation (standard jobs)I400 physical nodes of Tera-100I10 runs of Light-ESP benchmark for 6 scenariosII1024, 2048, 4096, 8192, 12288 and 16384 nodesGlobal execution time measurementIReflect the internal scheduler logic scalability when number of resourcesincreased at the same pace as job sizesScalability evaluation (small jobs)I200 physical nodes of Tera-100IIParallel Light-ESP x10 benchmark for 6 scenariosIIIsame results as with 4001024, 2048, 4096, 8192, 12288 and 16384 nodesJobs are ten times smaller (in size) than for the previous evaluationGlobal execution time measurementIReflect the internal scheduler logic scalability when number of resourcesincreased but with a workload composed of a larger number of small jobs26 / 33

Resources scalability for standard jobsSystem utilization for Light ESP synthetic workload of 230jobsand SLURM upon 1024 nodes (16cpu/node) cluster(emulation upon 400 physical nodes)System utilization for Light ESP synthetic workload of 230jobsand SLURM upon 4096 nodes (16cpu/node) cluster(emulation upon 400 physical nodes)18000System UtilizationJob Start Time ImpulseSystem UtilizationJob Start Time Impulse16000600001400050000Number of CoresNumber of 0000200400600Time (sec)800100012000200System utilization for Light ESP synthetic workload of 230jobsand SLURM upon 8192 nodes (16cpu/node) cluster(emulation upon 400 physical nodes)600800Time (sec)100012001400System UtilizationJob Start Time ImpulseSystem UtilizationJob Start Time Impulse140000400System utilization for Light ESP synthetic workload of 230jobsand SLURM upon 16384 nodes cluster(emulation upon 400 physical nodes)250000120000200000Number of CoresNumber of 0200400600800Time (sec)100012001400160005001000150020002500Time (sec)27 / 33

Resources scalability for standard jobsSystem utilization for Light ESP synthetic workload of 230jobsand SLURM upon 1024 nodes (16cpu/node) cluster(emulation upon 400 physical nodes)System utilization for Light ESP synthetic workload of 230jobsand SLURM upon 4096 nodes (16cpu/node) cluster(emulation upon 400 physical nodes)18000System UtilizationJob Start Time ImpulseSystem UtilizationJob Start Time Impulse16000600001400050000Number of CoresNumber of 0000200400600Time (sec)800100012000200400600800Time (sec)100012001400Scalability is great up to a threshold between 4096 and 8192 nodesI140000The120000System utilization for Light ESP synthetic workload of 230jobsand SLURM upon 8192 nodes (16cpu/node) cluster(emulation upon 400 physical nodes)controller is heavily loaded at the end of large jobsSystem UtilizationJob Start Time ImpulseSystem UtilizationJob Start Time Impulse250000IEvery node involved in a job sends its own completion messageIVisible in the stretching of the diagramI80000I60000Antoo much time spent in handling that200000load on the controllerNumber of Cores100000Number of CoresSystem utilization for Light ESP synthetic workload of 230jobsand SLURM upon 16384 nodes cluster(emulation upon 400 physical nodes)150000aggregation strategy must be studied for better e (sec)100012001400160005001000150020002500Time (sec)28 / 33

Resources scalability for small jobsSystem utilization for Light ESPx10 synthetic workload of 2280jobsand SLURM upon 1024 nodes (16cpus/node) cluster(emulation upon 200 physical nodes)System utilization for Light ESPx10 synthetic workload of 2280jobsand SLURM upon 4096 nodes (16cpus/node) cluster(emulation upon 200 physical nodes)System UtilizationJob Start Time Impulse16000System UtilizationJob Start Time Impulse600001400050000Number of CoresNumber of 2001400Time (sec)Time (sec)System utilization for Light ESPx10 synthetic workload of 2280jobsand SLURM upon 8192 nodes cluster(emulation upon 200 physical nodes)System utilization for Light ESPx10 synthetic workload of 2280jobsand SLURM upon 16384 nodes cluster(emulation upon 200 physical nodes)140000System UtilizationJob Start Time Impulse1600System UtilizationJob Start Time Impulse250000120000200000Number of CoresNumber of 05001000Time (sec)15002000050010001500Time (sec)20002500300029 / 33

Resources scalability for small jobsSystem utilization for Light ESPx10 synthetic workload of 2280jobsand SLURM upon 1024 nodes (16cpus/node) cluster(emulation upon 200 physical nodes)System utilization for Light ESPx10 synthetic workload of 2280jobsand SLURM upon 4096 nodes (16cpus/node) cluster(emulation upon 200 physical nodes)System UtilizationJob Start Time Impulse16000System UtilizationJob Start Time Impulse600001400050000Number of CoresNumber of 2001400Time (sec)Time (sec)System utilization for Light ESPx10 synthetic workload of 2280jobsand SLURM upon 8192 nodes cluster(emulation upon 200 physical nodes)System utilization for Light ESPx10 synthetic workload of 2280jobsand SLURM upon 16384 nodes cluster(emulation upon 200 physical nodes)1600Scalability is great up to a threshold between 4096 and 8192 nodesI140000VeryI120000high efficiency for clusters up to 4096 nodesSystem UtilizationJob Start Time Impulse200000responsiveness of the system but system utilizations for largeclusters collapseNumber of CoresI100000GoodNumber of CoresSystem UtilizationJob Start Time Impulse250000Visible on the high packing of the 2 first diagrams8000060000I150000Further investigations required, probably a smoothed version of theprevious issue (small jobs no longer so small at that scale)1000004000050000200000005001000Time (sec)15002000050010001500Time (sec)20002500300030 / 33

PlanIntroduction-MotivationsExperimental MethodologyEvaluation Results - DiscussionsScalability in terms of number of submitted jobsScheduling Efficiency in terms of Network Topology aware placementScalability in terms of number of resourcesConclusions - Future Works31 / 33

Conclusions-Future WorksExperimental MethodologyIEmulation of large clusters on current existing systems is reallyinterestingIAutomation of emulation in production jobsIGood to evaluate the behavior of the RJMS internals at scaleIGreat to avoid a particular set of physical nodes as the back-endIEase reproduction of identical benchmarks to guarantee the resultsIIAvoid to reconfigure everything when a single node has a failureMore workloads, synthetic and real, should be testedIEvaluate more scenarios and detect issues to correct them32 / 33

Conclusions-Future WorksEvaluation Results and ObservationsILarge workload managementITopology effectivenessIIGood results up to 10k jobs with minor modifications (all included now)Good results but gains are possible with an enhanced management ofintermediate levelsIINeeds to add a multi-levels best-fit logic of selectionNodes scalabilityIIScalability threshold between 4096 and 8192 nodes (depends on theworkload)Improvements in completion messages handling seems mandatoryIIILarge impact on performances on large clustersOne of the few n-to-1 communication patterns of SLURMMore tests on large workload and large clusters are required33 / 33

Light-ESP model I shrinked execution time than ESP I 230 jobs and 14 classes of jobs I Largest jobs use 50% of the machine (ignoring Z jobs) I Smallest jobs use 3.125% of the machine Parallel Light-ESPx10 model I shrinked execution time than ESP I 10x smaller job sizes I no Z jobs (13 classes) I 10x more jobs (2280 jobs parallel launches)