Exploiting Fine-Grained Data Parallelism With Chip Multiprocessors And .

Transcription

Exploiting Fine-Grained Data Parallelism withChip Multiprocessors and Fast BarriersJack Sampson CSE Dept. UC San DiegoRubén González†Dept. of Comp. Arch. UPC BarcelonaJean-Francois Collard, Norman P. Jouppi, Mike SchlanskerHewlett-Packard LaboratoriesPalo Alto, CaliforniaAbstractWe examine the ability of CMPs, due to their lower onchip communication latencies, to exploit data parallelism atinner-loop granularities similar to that commonly targeted byvector machines. Parallelizing code in this manner leads toa high frequency of barriers, and we explore the impact ofdifferent barrier mechanisms upon the efficiency of this approach.To further exploit the potential of CMPs for fine-graineddata parallel tasks, we present barrier filters, a mechanismfor fast barrier synchronization on chip multi-processorsto enable vector computations to be efficiently distributedacross the cores of a CMP. We ensure that all threads arriving at a barrier require an unavailable cache line to proceed,and, by placing additional hardware in the shared portionsof the memory subsytem, we starve their requests until theyall have arrived. Specifically, our approach uses invalidationrequests to both make cache lines unavailable and identifywhen a thread has reached the barrier. We examine two typesof barrier filters, one synchronizing through instruction cachelines, and the other through data cache lines.1. IntroductionLarge-scale chip multiprocessors will radically change thelandscape of parallel processing by soon being ubiquitous,giving ISVs an incentive to exploit parallelism in their applications through multi-threading.Historically, most multi-threaded code that exploits parallelism for single-application speedup has targeted multi-chipprocessors, where significant inter-chip communication latency has to be overcome, yielding a coarse-grained parallelization. Finer-grained data parallelization, such as overinner-loops, has more traditionally been exploited by vector machines. Some scientific applications have dependencies between relatively small amounts of computations, and Started while at HP Labs.† Whileat HP Labs.Also funded by NSF grant No. CNS-0509546.Brad CalderCSE Dept. UC San Diegoand Microsoftare best expressed using a SIMD or vector style of programming [26]. These dependencies may be implicit stalls withina processing element on a vector machine. In partitioningthese calculations across multiple cores these dependenciesrequire explicit synchronization, which may take the form ofbarriers. With chip multi-processors, the significantly fastercommunication possible on a single chip makes possible theimplementation of very fast barrier synchronization. Thiswill open the door to a new level of parallelism for multithreaded code, exploiting smaller code regions on many-coreCMPs. The goal of our research is to explore how to usemany-core CMPs to exploit fine-grained data parallelism, andwe focus primarily on vector calculations. However, our approach is more general, since it can work on some codes thatmay not be easily vectorizable.To show the importance of fast barrier synchronization,consider Table 1, which shows the best speedups achievedby software-only barriers when distributing kernels across a16-core CMP. For the Livermore loops, performance numbers shown are for vector lengths of 256; when distributedacross 16 cores (in 8-element chunks for good performancewith 64B cache lines), at a vector length of 256, all cores arekept busy, with each CMP core working on one (Livermoreloop 2) or more chunks per iteration. As can be seen fromthe table, speedups using software-only approaches are notalways present. This lack of speedup is primarily due to therelatively high latency of the software barriers in comparisonto the fine grain parallelism of the distributed computation.In contrast, the approach we will describe always provides aspeedup for the parallelized code for all of the benchmarks.In this paper, we study the role of fast barriers in enablingthe distribution and synchronization of fine-grained data parallelism on CMPs. We examine the impact of different barrierimplementations on this manner of parallelization, and introduce barrier filters, a new barrier mechanism.Our barrier filter design focuses on providing a solutionthat does not require core modification. This includes notmodifying the pipeline, register file, nor the L1 cache, whichis tightly integrated with the pipeline. Our design also does

KernelLivermore loop 2Livermore loop 3Livermore loop 6EEMBC AutocorrelationEEMBC ViterbiBest SoftwareBarrier0.421.522.083.860.76Table 1. Speedups and slowdowns achievedon kernels distributed across a 16 core CMPwhen using the best software barriers, in comparison to sequential versions of the kernelsexecuting on a single core. Numbers less than1 are slowdowns, and point to the sequentialversion of the code as being a better alternativeto parallelism when using software barriers.not require new instructions, as there are existing ISAs (e.g.,PowerPC, IA-64) that already provide the necessary functionality. Instead, we leverage the shared nature of CMP resources, and modify the shared portions of the memory subsystem. The barrier filter relies on one intuitive idea: wemake sure threads at a barrier require an unavailable cacheline to proceed, and we starve their requests until they allhave arrived.In this paper we make the following contributions: We show the potential of CMPs as a platform well-suitedto exploiting fine-grained data parallelism for vectorcomputations. We introduce a new method for barrier synchronization,which will allow parallelism at finer granularities thantraditionally associated with software barriers. The hardware support required by the method is less intrusive than other hardware schemes: no dedicated networks are needed, and changes are limited to portions ofthe memory subsystem shared among cores, making themethod well suited to the way CMPs are designed today.2. Related WorkHardware implementations of barriers have been aroundfor a long time; most conceptually rely on a wired-ANDline connecting cores or processors [8]. Other combiningnetworks for synchronization include multi-stage and singlestage shuffle-exchange networks [12] and wired-NOR circuits [22, 3]. Beckmann and Polychronopolous [3] assume adedicated interconnect network for transmission of messagesconsisting of (Barrier ID, Barrier enable) and (Barrier ID, SetProcessor Status) from each processor to a globally visible array of bit-vectors, each bit-vector corresponding to a barrier.For each barrier, zero-detect logic (wired-NOR) associatedwith each bit-vector then generates a signal upon the barriercondition being satisfied, passes the signal to a switch-boxprogrammed by the last Barrier ID sent from each processor,and resets the global bit-vector associated with the satisfiedbarrier. The switch-box then distributes the signal, via an additional dedicated interconnect network, to the participatingsubset of processors.Dedicated interconnect networks for barrier synchronizations are not uncommon among massively parallelcomputers, appearing in the Ultracomputer’s fetch-andadd-combining switches [11], Sequent’s synchronization’sbus [2], the CM-5’s control network [16], the Cray T3D [21],and the global interrupt bus in Blue Gene/L[1, 7]. Ourwork is primarily focused on much smaller scale systems,namely CMPs, with a more commodity nature. On a smallerscale system, although not commodity in nature, Keckler,et. al. [14] discuss a barrier instruction on the MIT MultiALU Processor that utilizes six global wires per thread to perform rapid barrier synchronizations between multiple execution clusters on the same chip. However, in contrast to all theabove approaches, we do not require additional interconnectnetworks for our new barrier mechanism, instead we focus onusing the existing memory interconnect network. We requireneither additional processor state registers, nor that our corepipeline be subject to control from additional interrupt lines.The Cray T3E [21], another massively parallel computer,abandoned the dedicated physical barrier network of its predecessor, the T3D, and instead allowed for the barrier/eurekasynchronization units(BSUs), to be connected via a virtualnetwork over the interconnect already used for communication between the processing nodes. The position of a particular BSU within a barrier tree was configurable via registers in the network router associated with a given node’sBSUs. Processing nodes would communicate with a localBSU, which would send a barrier packet to their parent inthe barrier tree when all of the BSU’s children had signaledarrival, and forward release notifications from the BSU’s parent to the children. Barrier packets were then given preferential routing priority over other network traffic, so as to mimichaving their own network. Blocking and restarting of threadsusing the T3E barrier hardware requires either polling the status of the BSU or arranging to receive an interrupt when BSUstatus changes. Blocking and restarting of threads using ourbarrier filter mechanism is more lightweight, requiring neither polling of external registers nor the execution of an interrupt handler, instead directly stalling a processor throughdata starvation, and restarting via cache-fill.Tullsen, et al. [24] examine hardware support for locks,but do not examine barriers, at both SMT (integrated withthe load-store unit), and CMP (integrated with the shared L2cache) levels. In contrast to this work, we do not introducenew synchronization primitives to access our hardware structures and alter pipeline flow, but instead make use of properties from existing instructions.Dedicated interconnection networks may also havespecial-purpose cache hardware to maintain a queue of processors waiting for the same lock [10]. The principal purpose

of these hardware primitives is to reduce the impact of busywaiting. In contrast, our mechanism does not perform anybusy waiting, nor does it rely on locks. By eliminating busywaiting, we can also reduce core power dissipation.Saglam and Mooney [20] addressed hardware synchronization support on SoCs. That work offers fast atomic accessto lock variables via a dedicated hardware unit. When a corefails to acquire a lock, its request is logged in the hardwareunit. When the lock is released, an interrupt will be generatedto notify the core. ([20] does not report barrier timings.)There is continued interest in improving the performanceof software based barriers, especially on large multiprocessor and multicomputer systems, where traditional hardwarebased solutions may encounter implementation cost, scalability, or flexibility issues [18]. Nikolopolous et al. [19] improvethe performance of existing algorithms by using hardwareprimitives for uncacheable remote read-modify-write for thehigh-contention acquire phases of synchronization operations, obviating coherence traffic, while using cache-coherentoperations for polling of the release condition. Cheng andCarter [5] exploit the increasing difference between local andremote references in large shared memory systems with a barrier algorithm tuned to minimize round-trip message latencies, and demonstrate the performance improvements possible through utilizing a coherence protocol that supports hardware write-update primitives. Our focus, unlike the aboveand most prior work concerning barriers, is on CMPs, ratherthan large collections of processors. We therefore have ratherdifferent scalability concerns, but the low latency of communication on CMPs allows us to explore finer grained parallelism than is traditionally exploited via thread level parallelism.Each thread is assigned a distinct arrival address by the operating system, which maps to a different cache line. A barrierfilter, a hardware structure consisting of a state table and associated state machines, is placed in the controller for someshared level of memory, potentially even main memory. Asincreased distance from the core implies increased communication latency, we envision the most likely placement of thebarrier filter to be in the controller for the first shared level ofmemory. The barrier filter keeps track of the arrival addressassigned to each thread. The filter listens for invalidation requests for these arrival addresses, which signal that a threadhas arrived at a barrier. After a thread has arrived at a barrier,it is blocked until all of the threads have arrived at the barrier.To perform the barrier, each thread executes an instructionthat invalidates the arrival address, and then attempts to access (load) the arrival address. The invalidate instruction removes any cache block containing this address from the cachehierarchy above the barrier filter, and indicates to the barrierfilter that the thread has reached the barrier. The thread thendoes a fill request, which the barrier filter will filter and notservice, as long as that thread is blocked at the barrier. Thisprevents the thread from making forward progress until all ofthe threads have arrived at the barrier. Once all of the threadshave arrived, the barrier filter will allow the fill requests to beserviced.After the fill request to the arrival address is resolved, athread needs to inform the filter that it has proceeded past thebarrier and is now eligible to enter a barrier again. One way todo so is to have the thread access a second address called anexit address.Thus, along with the arrival address, each threadis also assigned an exit address, which represents a distinctcache-line and is also kept track of in the barrier filter.3. Barrier Filter3.2. Barrier Filter Architecture3.1. Barrier Filter OverviewWe now describe the barrier filter architecture and thebaseline multi-core architecture we assume. Figure 1 shows arepresentative CMP organization, where each core has a private L1 cache and accesses a shared L2 spread over multiplebanks. This abstract organization was also selected in [4].The number of banks does not necessarily equal the number of cores: The Power5 processor contains two cores, eachproviding support for two threads, and its L2 consists of 3banks [13]; the Niagara processor has 8 cores, supporting 4threads each, and a 4-way banked L2 cache [15]. In Niagara,the interconnect linking cores to L2 banks is a crossbar.The barrier filter architecture is shown in Figure 1 to beincorporated into the L2 cache controllers at each bank. Asmemory requests are directed to memory channels dependingon their physical target addresses, the operating system mustmake sure that all arrival and exit addresses it provides fora given barrier map to the same filter. For our design, thehardware can hold up to B barrier filters associated with eachL2 bank.Each L2 controller can contain multiple barrier filters,The goal of our approach is to find a good performance vs.design cost trade-off for fast global barrier synchronization.Our design is based on the observation that all memory access requests that go through a cache stall until the cache filloccurs. We can extend this mechanism to stall later memoryrequests and thus perform synchronization. We can providebarrier synchronization for a set of threads by making thosethreads access specific cache lines, which are not filled untilthe criteria for the barrier occurs. A thread using this barriermechanism proceeds through the following three steps: (1)a signaling step, to denote the thread’s arrival at the barrier,(2) a synchronizing step, to make sure the thread does notcontinue to execute instructions after the arrival at the barrier,until the barrier access completes, and (3) upon resuming execution, a second signaling step, to denote that the thread hasproceeded past the barrier.We achieve global barrier synchronization by having eachthread access a specific address, called an arrival address.

Figure 1. Organization of a standard multi-coreaugmented with a replicated filter integratedwith the L2 cache controller.each supporting a different barrier being used by a program.Figure 2 shows the state kept track of for a single barrier filter. The barrier filter has an arrival address tag and an exit address tag and a table containing T entries, where T is the maximum number of threads supported for a barrier. The operating system allocates the cache line addresses for a barrier filter in such a way that the lower bits of the arrival and exit address can be used to distinguish which thread is accessing thebarrier filter. This also means that only one arrival address tagneeds to be stored to represent all of the arrival addresses, andsimilarly only one exit address tag needs to be used to identifyall of the exit addresses. Therefore, the tags are used to identify the addresses, and the lower bits are used to index into theT entries. Each thread entry contains a valid bit, a pendingfill bit, indicating for the arrival address if a fill request is currently blocked, and a two bit state machine representing thestate of the thread at the barrier. A thread is represented in abarrier filter by its state being in either the Waiting-on-arrival(Waiting), Blocked-until-release (Blocking) or Service-untilexit (Servicing) states. This state is associated with boththe arrival and exit address for the thread, which were assigned to this thread by the operating system. Each barrierfilter also contains a field called num-threads, which isthe number of threads participating in the barrier, a counterrepresenting the number of threads that arrived at the barrier(arrived-counter), and a last valid entry pointer usedwhen registering threads with the barrier filter.Each thread will execute the following abstract code sequence to perform the barrier. Some of the steps can bemerged or removed depending upon the particulars of the ISAused for a given implementation:memory fenceinvalidate arrival addressdiscard prefetched data (flush stale copies)Figure 2. A Barrier Filter Table.load or execute arrival addressmemory fence (needed only for some implementations)invalidate exit addressNote that all of the above instructions exist on some modern ISAs (e.g., PowerPC), so for those ISAs, no new instructions would have to be added. These architectures containinvalidate instructions as well as memory fence or synchronization instructions.We assume throughout the rest of this paper that the memory fence ensures that the invalidation of the arrival addressonly occurs after all prior fetched memory instructions execute. This is achieved, for example, on the Power PC, byexecuting the sync instruction, which guarantees that thefollowing invalidate instruction does not execute before anyprior memory instructions. This ordering is enforced to allowthe barrier filter to work for out-of-order architectures.After this code is executed, the thread must tell the barrierfilter that it has made it past the barrier. One way to informthe barrier filter is to invalidate the exit address. The exitaddress is needed because the filter cannot otherwise knowthat a given thread has received the data from the fill requestserviced by the filter. The barrier filter needs to know wheneach thread has been correctly notified, so that it can then startlistening for future arrival invalidations from those threads.When a barrier is created, it starts off witharrived-counter set to zero, num-threads setto the number of threads in the barrier, and the ArrivalAddress and Exit Address tags initialized by the operatingsystem. All of the threads in the barrier start out in theWaiting state as shown in Figure 3.

to pending instruction destination registers [9]. Outstandingfill requests to barrier filters thus occupy an MSHR slot inthe core originating the request. The MSHR is released oncethe request is satisfied, or when the thread is context switchedout. Hence in a simultaneously multithreaded core, it is important to have at least as many MSHR entries as contextsparticipating in a barrier. However, providing at least oneMSHR entry per SMT context is needed for good performance anyway, so the adoption of barrier filters should notchange a core’s MSHR storage requirements in practice.Figure 3. Finite State Automaton that implements the filter for one thread in a given barrier.Both arcs to the Servicing state are describedby the lowermost annotation.The barrier filter then examines invalidate messages thatthe L2 cache sees. When an address invalidate is seen, anassociative lookup is performed in each barrier filter to seeif the address matches the arrival or exit address for any ofthe filters. This lookup is done in parallel with the L2 cacheaccess, and due to the small size of the barrier filter state,the latency of this lookup is much smaller than the L2 accesslatency. Thus, overall L2 access time would not be adverselyaffected. In our simulations, we therefore keep L2 latencyfixed for both the base and filter variants.As shown in Figure 3, if the barrier sees the invalidateof an arrival address and the thread’s state is Waiting, thenthe thread’s state will transition to the Blocking state, andarrived-counter is incremented. If the barrier sees theinvalidate of an arrival address and the thread’s state is Blocking, then the thread’s will stay in the Blocking state.As shown in the code above, after a thread executes an invalidate for the arrival address, it will then access that addressto do a fill request. If a fill request is seen by the barrier filterand the state of the thread is Blocking, then the state will stayblocked, and the pending fill bit for that thread will be set.The fill will not be serviced, because we will only servicethese fills once all of the threads have accessed the barrier.This pending fill is what blocks the thread’s execution waiting for the barrier.When a thread in state Waiting sees an arrival address invalidation, and arrived-counter is one less than num-threads,we know all of the threads have blocked. We therefore clearthe arrived-counter and set all of the states for the threads toServicing. In addition, we process all of the pending fill requests for the threads. If a fill request comes in for the arrivaladdress and we are in state Servicing, then the fill request isserviced. If the barrier sees the invalidate of an exit addressand the thread’s state is Servicing, then the thread’s state transitions to the Waiting state.3.2.1MSHR UtilizationMiss Status Holding Registers (MSHRs) in cores keep trackof addresses of outstanding memory references and map them3.3. Interaction with the Operating System3.3.1Registering a BarrierWe have assumed that the barrier routines are located within abarrier library provided by the operating system. This libraryalso provides a fall-back software based barrier. We also assume an operating system interface that registers a new barrier with the barrier filter by specifying the number of threadsand returns to the user code a barrier handle in response. Arequest for a new barrier will receive a handle to a filter barrier if one is available, and if there are enough filter entries tosupport the number of threads requested for the barrier. If therequest cannot be satisfied, then the handle returned will befor the fall-back software barrier implementation.Each thread can then use the OS interface to register itselfwith the filter using the barrier’s handle, receiving the virtualaddresses corresponding to the arrival address and the exitaddress as the response. Once all the threads have registered,then the barrier will be completely constructed and ready touse. Threads entering the barrier before all threads have registered will still stall, as the number of participating threadswas determined at the time of barrier creation.3.3.2Initializing the BarrierAs discussed above, each thread is assigned a distinct arrivaladdress by the operating system, which maps to a differentcache line. Moreover, a given filter must be aware of all addresses assigned to threads participating in the barrier it supports. As pointed out in Section 3.2, the operating systemmust allocate the cache line addresses for a barrier filter insuch a way that the lower bits of the arrival and exit addresscan be used to distinguish which thread is accessing the barrier filter. With this convention, only one arrival address tagneeds to be stored to represent all of the arrival addresses,and similarly only one exit address tag needs to be used toidentify all of the exit addresses.If the target platform has multiple memory channels, aswe have been assuming in this paper, memory requests are directed to memory channels depending on their physical targetaddresses. Care must thus be taken if the filter is distributedacross channels, as we again assumed. In this situation, theoperating system must make sure that all arrival and exit addresses it provides (for a given barrier) map to the same filter.

The operating system is responsible for initializing filterstate, such as the tags for the arrival and exit address, andthe counters arrived-counter and num-threads. Toaccomplish this, we assume the data contents of the filtersto be memory mapped in kernel mode, obviating a need foradditional instructions to manipulate them.3.3.3Context Switch and Swapping Out a BarrierThe operating system can context switch out a stalled threadthat is waiting on its fill request, which is blocked by the barrier filter. When this occurs, the fill request for the arrival address will not have committed, so when the thread is rescheduled it will re-issue the fill request again and stall if the barrierhas not yet been serviced.We do not assume pinning to cores, and blocked threadsmay therefore be rescheduled onto a different core. The distinct arrival and exit addresses for each thread are sufficient touniquely identify the thread regardless of which core it is running on. If the barrier is still closed when the thread resumesexecution, the filter still continues to block this address whenthe rescheduled thread generates a new fill request. If the barrier opened while the thread was switched out, then when therequesting thread generates a new fill request, the barrier willservice that request, as the corresponding exit address has notyet been invalidated. The servicing of the request to the coreon which the thread was previously scheduled will not interfere with a subsequent barrier invocation, as any subsequentinvocation performs an invalidate prior to again requestingthe line on which it may block.In addition, the operating system can swap out the contentsof a barrier filter if it needs to use it for a different application(a different set of threads). When it does this, the operatingsystem will not schedule any threads to execute that are associated with that barrier. Therefore, a barrier represents aco-schedulable group of threads, that are only allowed to bescheduled when their barrier filter is loaded. This also meansthat the transitive set of threads assigned to a group of barriers needs enough concurrent hardware barrier support to allow these barriers to be loaded into the hardware at the sametime. For example, if a Thread X needs to use Barrier A andB, and Thread Y needs to use Barrier B and C, then the OSneeds to assign addresses and ensure that there is hardwaresupport so that Barrier A, B and C can be scheduled and usedat the same time on the hardware. If not, then an operatingsystem would return an error when a thread tries to add itselfto a barrier handle using the previously described interface.The error should then be handled by software.3.3.4Implementation Debugging and Incorrect BarrierUsageNote that the reason why a thread’s state does not go directlyfrom Servicing to Blocking is to ensure the operating system’s implementation of the barrier is obeying correct semantics, so we can flag these as errors. These few error cases rep-resent invalid transitions in the FSM. If a thread’s state in thebarrier is in Waiting and a fill request (load) accesses the arrival address then an exception/fault should occur to tell theoperating system that it has an incorrect implementation oruse of the barrier filter. Similarly, an exception should occur if an invalidate for the arrival address for a thread is seenwhile the thread is in either the Blocking or Servicing states.Finally, when a thread is in the Waiting or Blocking statesand the thread in the barrier filter sees an invalidate for itsexit address, an error should also be reported.The only time that the filter barrier implementations couldcause a thread to stall indefinitely is if the barrier is used incorrectly. For example, incorrectly creating a barrier for morethreads than are actually being used could cause all of thethreads to stall indefinitely. But the same is true for any incorrect usage of any barrier implementation.Note that unlike a software barrier, which would experience deadlock inside an infinite loop, due to limitations of theconstituent cores, a cache miss fill request may not be able tobe delayed indefinitely. In this case the filter may generatea reply with an error code embedded in the response to thefill request. Upon receipt of an error code, the error-checkingcode in the barrier implementation could either retry the barrier or cause an exception. In terms of the barrier implementation itself, Figure 3 can be augmented with exceptions forincorrect state transitions and for error codes embedded in fillresponses upon hardware timeouts.3.4. Detailed ImplementationsWe examine two specific implementations of our barrierfilter. The first uses instruction cache blocks to form the barrier, and the second uses data cache blocks. Both techniquesuse a similar instruction sequence to that described above.Note that invalidate instructions only invalidate a singlecache line, and for a data cache, the line is written back ifdirty on invalidate. These instructions are assumed to be usermode instructions

to exploiting fine-grained data parallelism for vector computations. We introduce a new method for barrier synchronization, which will allow parallelism at finer granularities than traditionally associated with software barriers. The hardware support required by the method is less in-trusive than other hardware schemes: no dedicated net-