Adaptively Scheduling Processes On A Simultaneous .

Transcription

Adaptively Scheduling Processes on a SimultaneousMultithreading ProcessorOmer Zaki Matthew McCormick Jonathan Ledlie{ozaki, mattmcc, ledlie} @cs.wisc.eduComputer Sciences Department1210 West Dayton St.Madison, WI 53706, U.S.A.AbstractWithin recent years the concept of the simultaneous multithreading (SMT) processor has beengaining in popularity. This hardware allows multiple processes to run on the processor at thesame time providing more potential for instruction level parallelism. These new processorssuggest that the rules an operating system (OS) scheduler follows need to be changed or at leastmodified. Our study shows the combination of jobs selected to run on these threads cansignificantly affect system performance. Our research shows that scheduling policies are greatlyaffected by the system workload and there most likely does not exist a single, best schedulingpolicy. However, it can be shown that a scheduler that tries to schedule processes doing a largenumber of loads and stores together with jobs doing few loads and stores consistently performsat levels close to or better than all other scheduling policies examined. It can also be seen thatthe more possibilities there are for scheduling, the more necessary it is to have an intelligentscheduler. In contrast, the few number of decisions to make (few threads and/or few processes)the less important the decision of a scheduler becomes.1 IntroductionSimultaneous Multithreading (SMT) processors enable more than one process to haveinstructions in flight at the same instant [Tullsen96].Given that the maximumnumber of processes which can be run is built in to the hardware’s implementation,this begs the question: what policy should be used to pick the processes to runtogether? Because different processes have different resource requirements, intuitionwould suggest that if we can find processes that are not using the same resources theywill conflict the least, they will be able to get the most instructions through thepipeline, and, as a consequence, we will achieve higher overall instructions per cycle.We show that if a mixture of resource using jobs is available and if we minimize the1

structural conflicts through scheduling, we obtain a small but consistent gain inoverall IPC over a less intelligent policy.The essential concept behind SMT is that only so much parallelism can beextracted from a given process but, if more processes contribute their instructions tothe pool from which the processor can choose, more parallelism and, therefore, morethroughput, will result.In an SMT, other processes’ instructions can flow around this stalled one. It ismuch less likely that interprocess dependencies exist between some process B and A,than the intraprocess dependency just witnessed.Two or more processes can conflict with one another if they are attempting toperform the same kind of instruction at the same time. For example, many concurrentloads and stores or ALU operations can clog these particular parts of the processorwhether they come from one process or several. This project seeks to ameliorate thesestructural bottlenecks by running processes which have different resource demands atthe same time.Before proceeding into how this scheduling is distinct from traditional operatingsystems scheduling, let us establish our nomenclature. For the purposes of this paper,a process or job is a software construct with long term state; the distinction betweenlightweight software threads and processes is ignored. Here, threads refer to hardwarecontexts, of which a processor has a fixed number (1, 2 and 4 in our tests). Processesrun on threads for a short period called a quantum. While each process is running,hardware counters track each process’s resources usage. After its quantum expires,software retains this state.Additionally, our schedulers are able to record andassociate hardware state like floating point usage with a particular process. This isfurther explained in Sections 3 and 4.2

Operating systems traditionally handle process scheduling by concentrating onthe amount of time a process is allowed to stay on a processor. The system keeps ahistory of the job’s behavior and tries to give it a quantum just long enough toaccomplish what it needs to and no more. Generally, it does not track what kinds ofinstructions the process is executing, only how long it demands to use the processor.For example, a job which has, in its recent past, only needed the processor to set up anew I/O request after which it relinquishes the processor, is given higher priority thana seldom yielding, computationally intensive process.The concept is simple: because these processes can quickly set up their workand then can go to sleep, waiting for the next keystroke or disk interrupt, for example,they should be given higher priority because their work can be done in parallel with aprocessor intensive job. This shortest job first policy leads to higher throughput thansome schedule like round robin. Operating system schedulers generally also includemeasures to allow for other means of prioritization and to prevent starvation. In thefigure below, the foci are when to remove a process and how to allow short jobs withsmall quanta to go first.3

On an SMT, the problem changes considerably. Here, an OS must not only determinewhich processes should run first and for how long, but also which run best together.In our study we have ignored the former problem and instead focused exclusively onthe latter. We have developed and evaluated three new ways for scheduling on an SMTand compared them with each other and with a random and round robin scheduler.1.1 A Simple Example4

To form a more concrete image in the reader’s mind, we would like to go through asimple example.Imagine that there is some CPU resource whose usage can beenumerated and compared to that of other processes.Examples of such resourcesinclude any resource that could potentially become a bottleneck: the result bus, loadand store queues, any or all of the integer or floating point units, the register file.Because we are not varying the amount of time a process lives on the CPU, each threadis relieved of its process in round robin fashion. As portrayed in the picture below,thread 1 has reached the end of its quantum. The other three threads will continue torun their resource intensive (20 each) processes. The advantage of switching only oneprocess at a time is that it lessens the deleterious impact of saving the process’sregisters and squashing its in flight instructions [Tullsen00]. Thread 1 is the victimand we must choose from the processes which have consumed 9, 10, 8, or 20 units ofthe resource during their last quanta.5

If we choose the process with 20 from the ready queue (for a total of 80 for therunners), the resource we are metering may become a structural bottleneck and stemthe flow of instructions through the pipeline. We argue that the best process to mixwith the resource intensive processes is 8 because this will keep the usage of theresource as low as possible when the heavy users are running. Ideally the systemwould soon achieve equilibrium where two 20s and two from { 8, 9, 10, 10 } alternate,minimizing the resource as a potential bottleneck while preventing starvation.Theschedulers which we developed and describe in Section 3.2 achieve this alternatingequilibrium.Section 2 discusses previous work on SMT scheduling and how the schedulerswe present are more flexible. Section 3 examines the schedulers themselves and howwe came to choose them from other resource possibilities. Section 4 describes how weextended an existing SMT simulator to include these scheduling policies. Section 5shows the schedulers’ performance on combinations of SPEC 2000 benchmarks.Section 6 looks at future work to build on the performance results. The last sectiondraws some conclusions about the SMT schedulers we have developed.2 Related WorkScheduling jobs on a SMT processor has only recently received attention while theSMT model has been studied for well over five years.In a single threaded architecture the objective is to have a single job running onthe processor at all times. When a running job has to wait or is made to wait, a readyjob is allocated the processor. Work in this area has focused on overlapping waits withcomputation, thereby attempting to utilize the processor to its fullest extent.Scheduling jobs on a single threaded processor has been well researched and several6

successful schemes have been proposed. Most schemes attempt to increasethroughput by overlapping I/O of one job with the calculations of others. For instance,the scheduling policy chosen for several flavors of Unix (4.3 BSD Unix [Thompson74],Unix System V, and Solaris TS (timesharing scheduling class)) is Multi levelFeedback that encourages I/O bound jobs to run more frequently, thus leading tohigher overall machine utilization.In the case of a SMT processor such a policy would not be sufficient, sinceseveral jobs can be coscheduled on the processor. The jobs executing together on theprocessor compete for hardware resources making the choice of jobs to coschedule animportant one, affecting processor utilization. Thus, the full benefits of SMT hardwarecan only be realized if the scheduler is aware of the interactions between thecoscheduled jobs [Tullsen00]. Scheduling jobs on a SMT thus, requires finer grainedschedulingby efficientlyallocating processor hardware resourcesamong thecoscheduled jobs.We are aware of only one other work that has studied scheduling for an SMT. Tullsen,et al., [Tullsen00] propose a scheme named SOS that schedules jobs in a two stepprocess. The first step, called the sampling phase, collects information on variouspossible schedules while the second step, called the symbiosis phase, uses sampledinformation to predict the schedule that would provide best performance and runs itfor a certain number of cycles until the next sampling phase is triggered. The samplingphase runs for a small number of cycles when compared to the symbiosis phase.Success of this scheme thus, largely depends on the quality of schedule chosen in thesampling phase. To reduce the overhead of sampling, only a small number of schedulescan be considered from the huge space of possible schedules. Information is gatheredfor each chosen schedule by running the jobs accordingly.7

2.1 MotivationThe SOS work demonstrates that the performance of a SMT processor is sensitive tothe set of jobs that are coscheduled by the operating system job scheduler. This formsthe basis of our project. There are however, two major limitations, we identify in SOS:1) The strict two step decision making process can potentially make scheduling lessresponsive to job workload changes and, 2) choosing a good schedule from the largespace of possibilities can be unwieldy in realistic scenarios. We propose a differentscheme that chooses jobs to run on an SMT processor based on the informationcollected on each of the ready jobs as well as the information collected on the executingjobs. Although, like the SOS scheme, ours involves collecting information on the jobs,we believe it is more adaptive to workload changes than SOS as it bases the schedulingdecisions on the most current state of the jobs. This differs from the strict two stepscheme of SOS.3 Adaptive SchedulingThere exists a huge space potential scheduling policies from which to choose. As aresult, a good portion of our project involved trying to come up with possible solutionsto the coscheduling problem.The initial phase involved writing several micro benchmarks designed to test very specific process interactions.For example, onemicro benchmark attempted to exploit spatial locality by accessing a large array incolumn major order. To complement this test, another program was written to accessthe same array in row major order. This gave us two programs with very differentusage characteristics of the same resource.8

Our other micro benchmarks included programs to to test temporal locality,branch prediction capabilities, and the number of loads and stores. Floating pointand integer versions of all of these programs were constructed.At this point, werandomly tested combinations of these benchmarks with each other. It was the resultsof these micro tests that lead us to select our scheduling policies. The policies wedecided to implement were a memory usage scheduler, a cache miss scheduler, and afunctional unit usage scheduler.We also implemented two simple, uninformedschedulers, round robin and random, to use as a baseline for our results.Before discussing the policies, it is worthwhile to mention what additions wouldbe necessary to the hardware to support these scheduling policies. First of all, theprocessor must be capable of doing context switching. Should all four processes bestopped? should two? or only one? In what order should they be preempted? Ourscheduler assumes that only one thread will be stopped at a time and the selection ofthreads to stop is done in a purely round robin fashion. Secondly, resource countersare needed for each thread. In this way, whenever a process running on a thread usesa resource, like a floating point adder or a memory read port, the resource counter forthat thread would be incremented by one. This data could then be retrieved by theoperating system whenever it desired the information.9

The first scheduler, MEM, attempts to schedule jobs doing a lot of loads and storeswith jobs doing few loads and stores. Along the same vain is the MISS scheduler thattries to schedule jobs with a high cache miss rate with jobs achieving a low cache missrate. The implementation for both of these schedulers is almost identical. They bothcalculate an average for all processes in the entire system. This average is based onthe resource being examined MEM averages memory accesses and MISS averagescache misses. Each waiting job in its turn is then averaged in with the job(s) that areto remain running on the processor.The process most recently removed from theprocessor is not considered during this process. The waiting job that, if it were run,would give the running processes an average closest to the system average is the onechosen to run.10

The functional unit scheduler, FU, is based on the concept that the bestscheduling policy will provide the best possible balance across all of the functionalunits. The FU scheduler works by going through each of the waiting jobs, assuming itwould run with the job(s) remaining on the processor and calculating the standarddeviation across all of the functional units. Again, the job most recently removed fromthe processor is not figured into these calculations. The waiting process that providesthe lowest standard deviation is the one selected to run.4 MethodologyWe evaluate our adaptive scheme by carrying out a series of experiments usingbenchmarks from the SPEC 2000 suite on an extended version of Zilles’ SMT simulator[Zilles99].4.1 Simulation infrastructureTo evaluate our adaptive scheme, we extended Zilles’s SMT simulator [Ziles99]. Prior toour extension, the simulator did not have support to schedule more jobs thanhardware threads. Zilles’ simulator models a SMT processor similar to the oneproposed in [Tullsen96] along with several enhancements. We configured the core ofthe processor to have 1, 2, or 4 threads. Instructions are scheduled out of order andall threads share a single fetch unit, branch predictor, decoder, centralized instructionwindow (128 entries), scheduler, memory system and a pool of functional units. Thememory system is comprised of separate 64KB L1 data and instruction caches, and a 1MB L2 unified cache. The functional units are 3 integer units, 6 floating point units, 1load unit and 1 store unit. It is capable of fetching 4 instructions per cycle and has 5pipe stages between fetch and decode, 7 pipe stages between decode and execute (for atotal 12 cycle mispredict penalty). Other specifications of the simulated machine aredescribed in [Zilles99].11

In extending Zilles’ simulator to support scheduling we added a: Context switch mechanism: A desired job running on a given hardware threadcan now be interrupted and replaced. Scheduler mechanism: A higher number of jobs than hardware threads cannow be run on the simulated processor.The jobs get to use the processorbased on policies that can be specified. Statistics gathering mechanism: To simulate hardware counters as in a realprocessor we added support to keep track of the resource consumption for everyjob while it used the processor. In our scheduling schemes we rely on thesegathered statistics to guide take well informed decisions.4.2 BenchmarksTen benchmarks were selected from the SPEC CPU2000 suite [SPEC CPU 2000]. Wereport data from six of them in Table 1. Three of these benchmarks are integerintensive and the remaining are floating point intensive. Our primary criterion forchoosing these benchmarks was their run time behavior. Crafty, for instance, heavilystressed the integer functional units by having a significant number of logicaloperations such as and, or, exclusive or and shift, while going easy on the store unit.Table 2 gives the functional unit usage for two sample INT and FP benchmarks.For each of these benchmarks we use the reference input set and generate EIO[Burger97] traces of 100 million instructions (on a Linux system). To take into accountthe initialization phase of each benchmark we skip the first 1 million instructions.In addition to these established benchmarks we created microbenchmarks todecide upon a set of heuristics to use in making job scheduling choices.Name12IPCLoads(million)Stores(million)L miss(million)S miss(million)Branches(million)Branch hits(million)Hit ratio

010.4060.1360.4069.5579.4620.990Table 1: Characteristics of some benchmarks. The data was collected for approx. 100million instructions. The ones shaded are INT and the others are FP.FU typeCraftyMgridNull1141390int ALU73,344,01921,892,622int multiply339,519163,574int divide00FP add/sub11556,655,872FP comparison00FP conversion36,1990FP multiply258,388,608FP divide20FP sqrt00rd port32,063,50743,869,561wr port7,298,8648,650,004Table 2: Functional unit usage of two benchmarks. The values represent the number ofinstructions that used the given unit. The data was collected for approx. 100 millioninstructions.5 ResultsTo test the various scheduling algorithms, we chose 6 benchmarks. These consisted of3 integer intensive and 3 floating point intensive benchmarks.Some relevantcharacteristics of these benchmarks together can be found in Table 1.There werethree basic sets of benchmark groupings: integer intensive only, floating point13

intensive only, and a combination of floating point and integer intensive benchmarks.We ran each of these groups with every scheduling algorithm. All tests were repeatedon 5 different hardware configurations (2 threads 4 processes, 2 threads 8 processes,2 threads 16 processes, 4 threads 8 processes, and 4 threads 16 processes), for atotal of 75 tests. A representative set of all of this data is presented in the followingsections.5.1 Floating Point Only BenchmarksGraphs 1 represents the average IPC for all of the FP intensive benchmarks runningunder a given set of test conditions. Note that whichever scheduler is run, if the onlyjobs in the system are floating point intensive, high IPCs result. In fact, there is verylittle difference between any of the policies when scheduling a floating pointbenchmark. We believe this is because of three factors. First, floating point programsstill involve a fair amount of integer work. Integer intensive programs

The SOS work demonstrates that the performance of a SMT processor is sensitive to the set of jobs that are coscheduled by the operating system job scheduler. This forms the basis of our project. There are however, two major limitations, we identify in SOS: 1) The strict two step dec