Balancing Locality And Parallelism On Shared-cache Mulit-core Systems

Transcription

2009 11th IEEE International Conference on High Performance Computing and CommunicationsBalancing Locality and Parallelism on Shared-cacheMulti-core SystemsMichael Jason CadeApan QasemTexas State UniversitySan Marcos, TXmc1235@txstate.eduTexas State UniversitySan Marcos, TXapan@txstate.eduAbstract—The emergence of multi-core systems opens newopportunities for thread-level parallelism and dramatically increases the performance potential of applications running onthese systems. However, the state of the art in performanceenhancing software is far from adequate in regards to the exploitation of hardware features on this complex new architecture.As a result, much of the performance capabilities of multi-coresystems are yet to be realized. This research addresses one facet ofthis problem by exploring the relationship between data-localityand parallelism in the context of multi-core architectures whereone or more levels of cache are shared among the different cores.A model is presented for determining a profitable synchronizationinterval for concurrent threads that interact in a producerconsumer relationship.Experimental results suggest that consideration of the synchronization window, or the amount of work individual threadscan be allowed to do between synchronizations, allows forparallelism- and locality-aware performance optimizations. Theoptimum synchronization window is a function of the number ofthreads, data reuse patterns within the workload, and the sizeand configuration of the last-level of cache that is shared amongprocessing units. By considering these factors, the calculation ofthe optimum synchronization window incorporates parallelismand data locality issues for maximum performance.Index Terms—shared-cache, parallelism, performance tuning,memory hierarchy optimizationI. I NTRODUCTIONIn the past, chip manufacturers have realized performanceincreases by packing more transistors on smaller chips. Asmodern designs approach the lower bounds on transistor size,as well as the upper bounds on cooling capacity, performanceincreases through increased design complexity and transistorsize reduction are becoming more difficult and more expensiveto achieve. In response, manufacturers have turned their effortstoward integrating multiple simplified processing cores onto asingle chip [8]. By focusing development toward the duplication of relatively small and simple design units on a singlechip, profitable advances in performance can again be realized.At the same time reductions in power consumption and designoverhead can be achieved [10]. By adequately capitalizing onopportunities for parallelism, a chip multiprocessor (CMP)with two cores can achieve the throughput of a single coreprocessor that is running at nearly twice the frequency. This isof enormous importance in the current power and heat limited978-0-7695-3738-2/09 25.00 2009 IEEEDOI 10.1109/HPCC.2009.61environment because lower frequencies imply lower powerconsumption and less heat [1], [14].Although CMP architectures promise large theoretical gainsin performance potential, these improvements can not beattained by hardware alone. In order to realize the full potentialof CMP systems much of the responsibility to find and exploitopportunities for parallelism is now placed on software andprogrammers [6]. In many cases, the state of the art in performance enhancing tools lacks the sophistication required tomake use of the full throughput and energy savings potential inmodern CMP systems. The problem of finding and exploitingopportunities for parallelism in software is a difficult one andwill require a great deal of effort if CMPs are to deliver attheir performance and power capacities.Further complicating the potential payoffs from the shifttoward CMPs is the fact that at some level, memory resources will be shared among different processing cores [18].When cache resources are shared among processors, datalocality and parallelism become related. Consider the dataparallel execution model. Given a large data set and a taskto be performed on the data, multiple processing units canbe assigned to perform the task on subsets of the data inparallel. As parallelism is increased by the addition of moreprocessing units there is more contention for shared memoryresources. The benefit of increased parallelism is then gainedwith an associated cost of reduced data locality. On the otherhand, if too much attention is given to data locality andexecution threads are delayed or shut down to avoid cacheeviction, the improvement in locality comes with the costof unexploited parallelism. Since the relationship betweenlocality and parallelism can be contentious, care must be takento find an appropriate balance between the two in order tooptimize performance. Additionally, because the bandwidth tomemory is shared by multiple processing units, underexploiteddata locality will force unnecessary memory accesses whichequate to needless consumption of power. Thus, in CMPsystems power efficiency has become tied to efficient use ofthe memory hierarchy [19], [23].When considering approaches to parallelization, there areessentially three models, data, task and pipelined parallelism.In the past, pipelined parallelism has not received as muchresearch attention as data or task parallel models. However,188

due to several factors that make pipeline parallelism morerelevant for CMPs, this model of parallelism is likely to playa significant role in parallelizing applications on multi-coresystems. Unlike data parallel models, pipeline parallelism caneffectively exploit locality within a shared cache since data isshared among the pipeline stages. Furthermore, pipeline parallelism can be used to parallelize important classes of applications that exhibit producer-consumer behavior. Applicationsthat fall within this domain include streaming multimedia suchas MPEG decoders, iterative stencil computations, differentialequation solvers and computational fluid dynamics code suchas mgrid and swim from the SPEC benchmark suite. For manyof these applications a data or task parallel model is eitherdifficult to construct or inefficient in practice. Finally, with theshift toward CMPs, parallelization of workloads is essential toperformance. The pipeline parallelism model can be used toparallelize many applications that may otherwise appear to becompletely sequential. In particular, the pipeline parallelismmodel could be generalized to parallelize loops with carrieddependencies.This thesis presents a model that captures the interactionbetween data locality and parallelism in the context of pipelineparallelism. To facilitate exploration of the relationship between parallelism and data locality, as well as the developmentof the parallelism-locality cache-reuse model, a syntheticbenchmark has been constructed. The benchmark embodies thememory reuse patterns and exploitable parallelism characteristics of several applications that exhibit the general producerconsumer behavior at various stages of computation.Experimental results suggest that consideration of the synchronization window allows for parallelism- and localityaware performance optimizations. The optimum synchronization window is a function of the number of threads, data reusepatterns within the workload, and the size and configuration ofthe last-level of cache that is shared among processing units.II. R ELATED W ORKThe related work is divided into three sections. First, workrelated to exploiting parallelism on CMPs is presented. Nextresearch that pertains to optimizing locality on CMPs isexplored. The third section addresses research where optimizations for locality and the exploitation of parallelism areconsidered together.A. Exploiting Parallelism on CMPsThe exploitation of parallelism on CMP architectures mustbecome a key focus if the performance gains promised by thisarchitecture are to be realized. In the past, legacy software hasbenefited from the hardware industry’s ability to increase performance without changing the computational paradigm. Now,to realize the full potential of CMPs, software must adjust toa new parallel paradigm of computation [6]. Applications thattake no measures to exploit the parallelism offered by CMPsmay even see performance decrease on CMP platforms. On theother hand, applications that exploit opportunities for thread-level parallelism may perform 50-100 percent better on multicore systems than on their superscalar predecessors [17].B. Data Locality Optimizations for CMPsThe issue of machine balance is particularly difficult incurrent multicore architectures because of the larger disproportion between the bus bandwidth and the compute powerprovided by multiple cores. Consider the Intel Clovertownprocessor. Each Clovertown chip contains four cores, eachcore capable of 10.64 GFlops. In total, one Clovertown chiphas a theoretical peak of 42.56 GFlops. However, the busbandwidth tops out at 10.64 GB/s which could provide atmost 1.33 GWords/s (64-bit words). Since one core that isexecuting a workload heavily biased toward floating-pointoperations is more than enough to saturate the memory bus,adding additional cores to execute similarly biased workloadswithout making an attempt to exploit data locality wouldprovide no significant benefit [2]. Buttari et al. describe suchan effort to exploit data locality in linear algebra algorithmsby representing operations as sequences of small tasks anddynamically scheduling them. Part of Buttari’s effort relies onreorganizing matrices into a block data layout rather than thecolumn or row major layouts found in FORTRAN and C [2].Data locality issues incur an additional layer of complexitywithin the CMP domain since in many cases, cores on aCMP share some of the cache facilities in addition to sharingthe memory interface [20]. Future architectures are likely tocontinue to share some cache facilities among cores on a die.Research has shown performance increases of several ordersof magnitude in data-intensive workloads on shared last levelcache arrangements versus CMPs whose cores have private lastlevel caches. Jaleel et al. assert that if multi-threaded applications tended to be data-independent, unique cache facilities foreach core would be the most appropriate approach. However, ifapplications tend toward sharing some amount of data, sharingthe last-level cache is a more appropriate approach. Jaleelreports a reduction in memory bandwidth demands by factorsof 3 to 625 when the last-level cache is shared versus privatelast-level caches [11].As more cores become available in CMP architectures, thereare more compute resources available to running processes.At the same time, more cores mean more contention forshared memory resources and interfaces [16], [24]. Thus,contention for these shared on-chip memory resources andinterfaces becomes one of the key issues in CMP performanceoptimization. Much work has been devoted to discoveringand exploiting the opportunities for data locality within individual loop nests [12], [13], [21]. Li and Kandemir offer amore global loop-based locality approach that they apply toheterogeneous multi-core architectures. In their work, Li andKandemir propose a system for considering all of the loopnests in an application simultaneously, thereby accounting forthe interactions among different loop nests [15].By exploiting the reuse of data that is already in cachewherever possible, two issues can be addressed. First, execution time is decreased because memory latencies are not189

incurred a second time when data that has already beenfetched can be reused rather than being cast out and fetchedagain later. Second, pressure on memory bandwidth is reducedbecause data does not have to be fetched multiple times. Dingand Kennedy present a two-step approach to reduce memorybandwidth requirements within a workload by exploiting datalocality. The first step involves fusing computations on thesame data to increase the amount of temporal locality. Thesecond step involves reorganizing the data layout to groupdata used by the same computation to increase spatial locality[5].Since the amount of activity on the memory bus can becorrelated to power consumption [22], reducing pressure onthe memory interface by exploiting data reuse has a positive effect on overall power consumption. Daylight et al.introduce data structure transformations that allow traversalthrough large data structures with fewer memory accesses andthereby decrease power consumption in embedded multimediasoftware [4].C. Integrating Parallelism and Locality Optimizations forCMPsWithin the context of CMP architectures, the literatureshows that software compiled with special attention given toparallelism and data locality issues runs with significant powerand performance benefits [15].One approach that incorporates parallelism with consideration for data locality is the stream programming model. Incontrast to the von Neumann model, stream programming usesa data-flow model. In the data-flow model, data is loadedinto local memory as a bulk load, operations are performedin parallel on the loaded data and the results are storedas a bulk store. Streamware is a flexible software systemproposed by Gummaraju et al. that can be used to map streamprograms onto a variety of multicore systems. Using the streamprogramming model, the authors are able to leverage theparallelism inherent in multicore systems while still accounting for data locality issues. They also claim that their toolsand methodologies are not restricted to traditionally streamedapplications like media and image processing software but canbe extended to general-purpose data-intensive applications [9].Vadlamani and Jenks’ Synchronized Pipelined ParallelismModel (SPPM) is another effort to incorporate parallelismwith data locality in a way that is appropriate for CMPs[20]. SPPM applies to an important class of workloads wherethe problem can be seen as a sequence of interdependentstages. In other words, these workloads can be modeled aschains of computational stages, each stage having a producerconsumer relationship with its temporal neighbors. The heartof the SPPM algorithm involves processing units workingsimultaneously through the dataset in a temporally staggeredarrangement. This allows processors that are ahead in thedataset to act as prefetch engines as well as producers of datafor the processing units that follow. Research shows that aworkload employing the SPPM model of parallelization incursan overall cache miss rate and memory bus utilization that issimilar to the cache misses and bus utilization if it were tobe run strictly sequentially while realizing the performanceincreases that come with parallelization [20].D. Limitations of the current body of literatureThe current body of literature is somewhat sparse wheredata locality and parallelism are considered together. Therelationship between locality and parallelism is of considerableimportance in light of modern CMP architectures, especiallyon CMP architectures where there is some sharing of cache.In shared cache systems, parallelism and data locality becomeinextricably intertwined. More research should be directedtoward modelling the behaviors surrounding this relationshipso that workloads may be optimized for these new CMParchitectures.III. A M ODEL FOR R ELATING PARALLELISM AND DATAL OCALITYThe parallelism-locality model presented here has beendeveloped through empirical study of a synthetic benchmark.The benchmark encapsulates the memory reuse patterns andparallelism characteristics of many workloads that exhibittemporal producer-consumer behavior at some point duringexecution.It is important to note that all of the calculations and modelshere presuppose a contiguous allocation of memory for thedataset. Also, when discussing cache, this work refers to thelast level of the cache hierarchy.The major computational component of the synthetic benchmark involves a large dataset that can be thought of as a threedimensional physical space. During execution, this dataset isupdated iteratively. During each iteration, each member of thedataset is updated as a function of its adjacent neighbors andit’s current value, such that the data at timen 1 is a functionof the data at timen .One approach to this type of iterative, time-step update isto create a duplicate of the dataset. One dataset represents thecurrent state, or timen , and the other represents the state ofthe data at timen 1 , or the next state. Let the current statedataset be called Dn and the next state dataset be Dn 1 .A wholly sequential approach involves one thread of execution iterating through the entire Dn dataset, calculatingthe next state values for each element, and writing thosevalues to the Dn 1 set. Once the entire next-state set has beenpopulated, the Dn set and the Dn 1 set are swapped, and theprocess repeats itself to calculate the next state. This continuesuntil the desired number of iterations are completed and thedata is in its final state, Dm .At the other extreme, up to m threads of execution maybe deployed to update the data in parallel so long as careis taken to ensure that threads operating on future updatesdo not collide with threads working on past updates. In thisparallel approach, the two datasets can be thought of as theeven iteration data and the odd iteration data. For example,thread0 would be deployed, reading from the D2n set andwriting to the D2n 1 set. After thread0 had read enough of190

j0iterations on the innermost index of the dataset without regardfor the other thread’s progress. The synchronization granularity, Wi , is expressed in the following equation.kiWmax Wmin 1(1)2After each Wi iterations on the innermost dimension of thedataset, there is a barrier that forces the threads to returnto the ideal separation distance. As Wi grows, so does theinterval between thread synchronizations. Consequently, asthis interval grows, synchronization overhead decreases.Wi nkiD2n 1Fig. 1. Multiple threads may operate simultaneously on the data so long asthey do not collide. Here, thread0 reads from D2n and writes to D2n 1 .A second thread, thread1 , may be deployed that reads from D2n 1 andwrites to D2n so long as mechanisms are in place to prevent thread1 fromoverwriting data that thread0 has not yet consumed.Withe D2n set and written enough of the D2n 1 set, thread1could begin reading from the D2n 1 set and writing to theD2n set. At some point, thread1 would be far enough along toallow thread2 to begin reading from the D2n set and writingto the D2n 1 set. This would continue until either m threadshad been deployed or, thread0 reached the end of the D2n setand could begin again at the top of D2n making the updatefor the next necessary state (Figure 1).In order to preserve data dependencies, the threads must beprevented from colliding. If a thread that is making an updatefor timet 1 gets too close, in terms of data, to the thread that ismaking the update for timet the data necessary for the timetupdate will be overwritten by the update for timet 1 andthe final result will be corrupted. Thus, some synchronizationdevice must be implemented to keep the threads sufficientlyfar apart.In the synthetic benchmark presented here, a second synchronization limit is put on threads operating in parallel. Inaddition to preserving data dependence by preventing threadsfrom getting too close together, an attempt is made to exploitdata locality by preventing threads from getting too far apart.This introduces the concept of an execution window. Synchronization for the synthetic benchmark’s threads is performedsuch that the threads of execution operate within a minimumand maximum window size (Wmin and Wmax , respectively),measured by data distance on the innermost index of thedataset (Figure 2).The execution window allows for the concept of a synchronization granularity. The synchronization granularity is thenumber of iterations that either thread may safely execute andstill maintain the execution window constraints. If thread0 andthread1 are the ideal distance from one another, then eitherthread may execute no more than 1/2(Wmax Wmin 1)Withread1 (ideal)WmaxWminthread0 (ideal)Fig. 2. Here, Wi is the synchronization granularity, Wmin is the minimumwindow size and Wmax is the maximum window size (2Wi 1 Wmin ).The optimal synchronization interval would maximize theexploitation of data locality while incurring the least amount ofsynchronization overhead. Therefore, Wi , and Wmax , shouldbe as large as possible. However, if the synchronization interval between thread0 and the last thread (threadn ) is too large,later threads will compete with earlier threads for cache ratherthan being able to capitalize on data cached from previousaccesses. For this reason, the distance, in terms of data,between thread0 and threadn should be within the size of thecache. Ultimately then, the optimal synchronization intervalis the largest interval such that the data between thread0and threadn fits within cache. In other words, the idealsituation has threadn accessing the oldest data in cache whilethread0 consumes the newest data in cache. The fraction ofthe cache between thread0 and threadn can be expressed bythe following equation:α(2)U Cwhere U is the amount of cache consumed by data thatthread0 has accessed but that threadn has not yet accessed,α is the cost, in terms of cache, of updating the elementsthat threadn has not yet updated but thread0 has finishedupdating for this pass through the dataset, and C is the total191

size of the cache.The cache cost factor can be expressed as: α(T 1) Wi β γ(3)innermost dimension of the dataset.222Wi 6(2 1) 2 26 9 23where T is the total number of threads deployed to update thedataset, β is the number of elements in the dataset for eachunit of Wi and γ is a measure of the cost, in terms of cache,of evaluating the next state for one element in the dataset.Here, T must be greater than or equal to two, since if thereare fewer than two threads there can be no cache consumptioncost between the first thread and the last thread. Also worthnoting here is that because data moves in and out of cacheon the granularity of cache lines rather than bytes, γ must beconsidered in terms of cache lines and not bytes.The γ term may be expressed in the following way: lS/dWimiss rate0.00250.00150.0010Wi β l d(5)S Lwhere L is the total number of lines in cache. L replaces Cin equation 2 since here cache costs are defined in terms ofcache lines and not bytes. However, since L can be rewrittenas C/S, where C is the total last-level cache size in bytes, thedenominator of equation 5 reverts to C, as in (2).For the synthetic benchmark described in this work, β isthe product of the second and third dimensions of the dataset.Also, since the ultimate goal is to maximize U without exceeding the cache capacity, the target value for U is one. Therefore,based on equation 5, the following represents the optimalsynchronization granularity for the synthetic benchmark.Wi (T 1) C(T 1) j k l d(6)For example, consider a three dimensional iterative stencilalgorithm applied to a dataset has a second and third dimensionof 64 elements, each element being eight bytes. With 64byte cache lines, and an update function that is dependanton all of an element’s neighbors, the l term is nine (onecache line is accessed for each of the nine rows that makeup the cube surrounding the element in question). Using thecache-use model in equation 6 with a four mega-byte lastlevel cache shared between two threads, the theoretical idealsynchronization interval is calculated to be 14.2 units on thethread0thread10.002where l is the number of cache lines accessed when calculatingthe next state for one element of the dataset, S is the sizeof one cache line and d is the size of one element of thedataset. Dividing l by S/d accounts for the locality inherentin contiguous accesses to the same cache-line. Of course,this expression of the cache-cost term presupposes a linearand ordered access pattern across multiple element updates. Ifthe access pattern were to randomly select elements from thedataset, the S/d term would be incorrect and γ would have tobe accounted for in a different way.Finally, U may be explicitly defined as: (7)0.0030.0005U14.20.0035(4)γ Fig. 3.15 20 25 30 35 40 45 50 55 60 65WmaxData miss rate for synthetic benchmark with four-kilobyte pages.IV. E XPERIMENTAL R ESULTSA. Testing EnvironmentExperimental results were collected by running the syntheticTMRCore 2 Duo with 4MBbenchmark on a 2.33GHz Intel of L2 cache shared between the two cores. The benchmarktests were run under Linux 2.6.24 with the perfctr moduleinstalled. Cache and TLB data were collected with PAPI.PAPI is a platform-independent specification for an interfaceto hardware performance counters. These counters exist onmodern microprocessors as a small set of registers that can beused to count the occurrence of specific events related to a processor’s function [7]. For the synthetic benchmark, there aretwo important timing measurements timewall and timewait .timewall is simply the amount of wall-clock time spent in thefunction that performs the iterative stencil algorithm. This ismeasured with two calls to gettimeofday, one at the function’sentry point and another at the function’s exit point. timewaitis a measure of the length of time threads spend waiting atsynchronization barriers. It is also measured with two calls togettimeofday, one immediately before entering a barrier andanother immediately after exiting a barrier. Synchronizationcounts were collected with counters in the benchmark codeitself.A dataset of substantial size is necessary to smooth thedata collected and to emphasize differences in measurementsunder different testing conditions. These goals can also besupported by iterating over the dataset a large number of times.In contrast, the dataset and iteration count should be smallenough to allow data to be collected in a reasonable amountof running time. For each test case presented here, the datasetis a 512 64 64 array of doubles. The array is updated using192

TABLE IP ERFORMANCE OF SEQUENTIAL VERSUS PARALLEL EXECUTION ON FOUR - KILOBYTE PAGES .Parallel (Wmax 31)thread0thread1totalSequentialTotal CyclesWall Clock TimeTLB MissesData Miss Rate1.20 101151.61 S3.56 1060.0035.81 101025.58 S1.84 1060.00315.64 101025.58 S1.82 1060.00071.14 101125.75 S3.66 1060.0019TABLE IIP ERFORMANCE OF 4KB VERSUS 16MB PAGES (Wmax 31).Total CyclesWall ClockTimeTLB MissesData MissRate4KB pagesthread0thread1sum5.81 10105.64 10101.15 101125.58 S25.58 S25.75 S1.84 1061.82 1063.66 1060.00310.00070.001916MB pagesthread0thread1sum5.78 10105.62 10101.14 101124.96 S24.96 S25.13 S5.31 1047.75 1041.31 1050.00310.00020.0017a three-dimensional iterative stencil algorithm where the nextstate of each element is dependent on the current state of theelement and the current state of all of its neighbors. The entiredataset is updated iteratively in this way 400 times for eachtest. These values represent a good compromise between asample large enough to reduce the noise in the data and asample size small enough to allow data to be collected in areasonable amount of time.Multi-threaded test cases involve two parallel threads ofexecution. Each thread is bound to a unique core using callsto pthread setaffinity np so that each thread is explicitlyscheduled on one and only one processing unit. The minimumwindow size is fixed at the smallest value that preserves datadependencies. For the synthetic benchmark described here, theminimum window size is fixed at two. Since the minimumwindow size is fixed, the independent variable is the maximumwindow size.Synchronization granularity, or Wi , is directly related tothe maximum window size; increasing Wmax increases thesynchronization granularity. Equation 1 shows the relationshipbetween Wmax and Wi . To evaluate the model presented inequation 6, the benchmark is repeated over a range of maximum window sizes. The results presented are the averagedresults of five test runs under the same testing conditions.B. Four-Kilobyte PagesInitially, a series of tests was conducted with standard fourkilobyte pages. The parallelized code showed a performanceimprovement over strictly sequential code (see table I). Although performance increased over sequential execution, theparallelized code did not exhibit the expected optima whenmanipulating the synchronization window sizes.Intuitively, there should be an optimal synchronization window size, where the maximum amount of data locality isexploited while incurring the minimum amount of synchronization overhead. Based on equations 6 and 1, an optimashould be at Wmax of 31. However, in tests using four-kilobytepages, the increase in miss rates is linear relative to theincrease in Wmax near the theoretical optimal synchronizationwindow size (see Figure 3).The wall-clock time results with four-kilobyte pages weresimilar. Experiments did not reveal a minima that wouldclearly point to an optimal synchronization window size.The inconclusive behavior of the benchmark running withfour-kilobyte pages is attributed to the small page size. At512 64 64 doubles, the dataset occupies 212 four-kilobytepages. Since there are two copies of the dataset, one for thecurrent state and one for the next state, the entire datasetcovers 213 four-kilobyte pages. Because the data covers morethan 8000 pages, and these pages are scattered around the realaddress space, the access pattern of real addresses is likely tobe somewhat disorderly when compared to the access patternof a contiguously mapped dataset. If the access pattern of realaddresses is disorderly data blocks are less likely to be mappedinto cache sets in an orderly way. Therefore, the lack of astrictly

When cache resources are shared among processors, data locality and parallelism become related. Consider the data-parallel execution model. Given a large data set and a task to be performed on the data, multiple processing units can be assigned to perform the task on subsets of the data in parallel. As parallelism is increased by the addition .