Phoenix: A Runtime Environment For High Performance .

Transcription

Proc. 17th Euromicro International Conference on Parallel, Distributed andNetwork-Based Processing – PDP'09, 2009, pp. 119-126.Phoenix: A Runtime Environment for High Performance Computing on ChipMultiprocessorsAvneesh Pant, Hassan Jafri, Volodymyr KindratenkoNational Center for Supercomputing Applications (NCSA)University of Illinois at Urbana-Champaign (UIUC)Urbana, IL, USAe-mail: {apant hjafri kindr}@ncsa.uiuc.eduAbstract—Execution of applications on upcoming highperformance computing (HPC) systems introduces a variety ofnew challenges and amplifies many existing ones. Thesesystems will be composed of a large number of “fat” nodes,where each node consists of multiple processors on a chip withsymmetric multithreading capabilities, interconnected viahigh-performance networks. Traditional system software forparallel computing considers these chip multiprocessors(CMPs) as arrays of symmetric multiprocessing cores, when infact there are fundamental differences among them.Opportunities for optimization on CMPs are lost using thisapproach. We show that support for fine-grained parallelismcoupled with an integrated approach for scheduling ofcompute and communication tasks is required for efficientexecution on this architecture. We propose Phoenix, a runtimesystem designed specifically for execution on CMParchitectures to address the challenges of performance andprogrammability for upcoming HPC systems. Animplementation of message passing interface (MPI) atopPhoenix is presented. Micro-benchmarks and a productionMPI application are used to highlight the benefits of ourimplementation vis-à-vis traditional MPI implementations onCMP architectures.Keywords-runtime; chip multiprocesors; MPII.INTRODUCTIONChip multiprocessors (CMPs) consisting of multipleprocessing cores on a single chip have become commodity.Memory bandwidth and power consumption limitations,coupled with the continued march of Moore’s law, areresulting in an ever increasing number of cores per chip [13].As supercomputing enters the petascale era, we anticipatethat upcoming systems will consist of a large number of fatnodes, consisting of multi-core processors with symmetricmultithreading (SMT) capabilities, interconnected via highperformance networking fabrics [11,12].Current state-of-the-art system software treats CMPs asarrays of symmetric multi-processing (SMP) cores, howeverthere are significant fundamental operational differencesbetween CMP and SMP architectures that need to beconsidered to achieve their true potential [10]. For instance,CMP processing elements tend to have smaller dedicatedcaches per core compared to SMP. The aggregate cache sizeof a CMP system can be up to an order of magnitude smallerthan a similar sized SMP system. Applications on CMP willneed to be threaded at a finer granularity level to reduce theirworking set size for efficient execution. The inter-corecommunication bandwidth and latency on a CMP can be anorder of magnitude more efficient than communicatingacross cores on a SMP. It is expected that memory latencyand bandwidth bottlenecks will be further exacerbated inCMPs, leading to poor application scaling. Furthermore, weexpect there will be a decline in available network bandwidthper core due to core counts increasing at a faster rate thannetwork bandwidth.Execution of parallel applications on these architecturesbrings to light a number of new challenges while magnifyingmany old problems. For applications, there are challenges tobe tackled in terms of programmability as well as an optimalruntime environment. Message passing paradigm, as used bymessage passing interface (MPI) [14], and partitioned globaladdress space (PGAS) [15,18] paradigm, as used by UPC,CAF, etc., give the programmer flexibility in architectingthe parallel application in that the programmer has extensivecontrol over details of parallelization and the capacity toextract maximum performance from the underlying hardwareresources. This flexibility, however, can be cumbersome tomanage on CMPs due to their inherent complexity and scale.A sophisticated parallel runtime system can help deal withthis complexity by efficient management of hardwareresources while providing interfaces and semantics for betterprogrammability.Parallel programming paradigms that rely on currentruntime systems are also not adequately tailored forexecution on CMPs. MPI implementations, for example,treat CMP as a collection of SMP cores, thus losing theopportunity to exploit features that enable fine-grainedparallelism. MPI tasks, implemented as heavyweightprocesses, incur high context switch overhead duringscheduling. Additionally, communication between tasks on anode incurs unnecessary overhead due to additional memorycopies across process address space boundaries. Theoperating system scheduler is responsible for scheduling ofMPI tasks on physical processors. Configurable applicationspecific scheduling policies are challenging to implement inthis environment as they will require kernel-levelmodifications.Software stacks for CMPs will be required to becognizant of these issues. An integrated approach forscheduling of constrained resources to minimize contention

and oversubscription is imperative. We propose Phoenix, amodular and extensible runtime for parallel applicationdevelopers, compiler writers and system researchers to use asa vehicle for researching parallel scientific computing onCMP platforms and as a fully optimized runtime frameworkon such systems. Phoenix runtime consists of a collection ofinterfaces for thread management, scheduling, locking,memory allocation, etc., explicitly designed to support thefine-grained parallelism present on CMP architectures. Thelayering of existing popular parallel programmingparadigms, such as MPI, atop Phoenix will enable efficientexecution of legacy applications on CMP architectures.Our prototype implementation of MPI represents MPItasks as lightweight Phoenix threads. A significantly largernumber of MPI tasks than processors on a node can bespawned. This enables support for fine-grained parallelismby ensuring that the working set of a task can fit within thelimited cache on CMPs. Context switch overhead is furtherreduced over conventional MPI implementations due to ouruse of lightweight thread abstractions to represent tasks.Since MPI tasks exist within the same address space, thisapproach leads to efficient communication between taskswithin a node. The Phoenix runtime system is used toprovide integrated scheduling of both MPI tasks and networkprogress threads. We show with micro-benchmarks and aproduction parallel MPI application that significantperformance gains can be obtained over conventional MPIimplementations with careful design of a system tuned forCMP architectures.The rest of this paper is organized as follows. Section IIgives a description of Phoenix, our intelligent-adaptiveruntime system and its constituent components. Section IIIcovers the MPI paradigm and its implementation on thePhoenix virtualization runtime. Section IV provides somepreliminary results using micro-benchmarks and aproduction MPI code called MILC [1,2]. Section V coversrelated work and section VI concludes this paper.II.remote direct memory access (RDMA) communicationsemantics. Fig. 1 provides the schematic view of the Phoenixecosystem. This section provides an in-depth view of thePhoenix ecosystem.A. Execution ContextsExecution contexts (ECs) are the basic unit of executionin Phoenix. ECs are analogous to threads, and the PhoenixEC component API is representative of such. The APIconsists of routines related to EC management (create, exit,id) and synchronization (suspend and wait). The EC interfaceis implemented using POSIX threads. In order to supportfine-grained parallelism present on CMP architectures, alarge number of ECs need to be instantiated. Lightweightthread abstractions that can lead to a substantial reduction inmemory footprint and context switch overhead are beinginvestigated.An important feature of Phoenix ECs is attribute tagging.Applications tend to execute in distinct phases. In each phasethe resource requirements of an application can varysignificantly. HPC applications, for example, during a giveniteration, may compute for a sizeable amount of timefollowed by exchanging data with neighboring processorsover the network. This pattern is repeated during the courseof a run. Phoenix, therefore, features the notion of ECattributes for identifying the dominant characteristic of ECexecution. The current supported attributes are: compute,network and I/O. An EC can change its attribute dynamicallyduring execution of each phase. Attribute tagging will allowthe exploration of various resource-aware schedulingstrategies that will minimize contention of resources.B. Processor VirtualizationEach schedulable processing element is represented by avirtual processor (VP) within the Phoenix runtime system.The VP API exposes primitives related to binding andPHOENIX RUNTIME ARCHITECTUREThe Phoenix core runtime (PCR) system provides a richfeature set for efficient layering of parallel programmingmodels. The core functionality for scheduling, threadmanagement, memory allocation, debugging, etc., iscontained within the runtime system. The PCR provides acompact and expressive interface that is suitable as a targetfor a compiler backend and experimenting with newlanguages for parallel computing designed to scale from thedesktop to large-scale systems. Each component of Phoenixprovides a well-defined interface, enabling system designersto experiment with different policies and implementations ofa given component by plugging them in at a definedapplication programming interface (API) layer. Thisflexibility is vital to explore a large design space foroptimizations on CMP architectures [16].Phoenix is designed to enable applications to scale from asingle core to potentially all cores available on a node. Inorder to support layering of parallel programming modelsthat span nodes, Phoenix provides a high-performancemodular network layer with support for active message andFigure 1. Phoenix ecosystem.

scheduling of ECs onto the physical processing element aswell as functions to query the underlying physicalcharacteristics of the hardware. In order to supportidentification of performance bottlenecks, the processorvirtualization component keeps track of vital profilinginformation during the course of execution. Each VPmaintains processor utilization, total context switches, idleand compute times for itself. This information can be queriedduring a run by end user tools or even Phoenix componentssuch as the scheduler to make informed decisionsC. SchedulerThe Phoenix scheduler dispatches ECs for execution onVPs. Phoenix implements cooperative scheduling semantics.Higher level programming models utilize this to present theillusion of independent progress for all ECs. Pre-emptivescheduling of ECs on CMP architectures is not desirable asparallel applications, which exhibit distinct communicationand compute phases, would benefit if entire cores werededicated to their ECs during the compute phase. Thescheduler provides a pluggable framework for Phoenix VPs,allowing implementation of processor agnostic schedulingstrategies. Scheduling across processors is accomplishedutilizing scheduling domains. A scheduling domainrepresents a group of VPs that exhibit common physical andperformance characteristics. VPs dispatch ECs scheduled ona domain for the underlying processor.In order to have effective control of scheduling policies,it is necessary to bypass the OS scheduler. The currentimplementation on Linux utilizes the native POSIX threadlibrary (NPTL). NPTL is a 1:1 thread library in that threadscreated by the user are in one-to-one correspondence withschedulable entities in the kernel. The Linux OS scheduler isnegated by ensuring that at most one EC—i.e., pthread—isin a runnable state per physical processor. This isaccomplished by utilizing condition variables andsemaphores respectively. Each EC waits on a semaphore,which results in the EC being suspended within the kernel.Phoenix schedules an EC by signaling the semaphorecorresponding to the EC, resulting in the pthread associatedwith the EC transitioning to a run state within the kernel.This approach, coupled with the ability to bind ECs toprocessors, provides full control over placement andscheduling of ECs within Phoenix.D. Memory Management, Instrumentation and ProfilingThe PCR contains a memory management subsystemwhere custom allocation strategies can be implemented. BothAMD’s Hypertransport and Intel’s Quickpath interconnectare tending toward a non-uniform memory access (NUMA)model. The memory allocation and the runtime both need tobe NUMA aware for efficient execution on thesearchitectures such that memory regions are allocated local tothe node the EC is executing on. Thread migration is anotherfeature that can benefit from custom allocation strategies. Asapplications become dynamic in nature, load-imbalancesacross processors will become more pervasive. A commonsolution is thread/work migration to achieve load balance[9,17] where an isomalloc memory allocation strategy can beemployed that allocates memory to each work unit/threadfrom mutually exclusive virtual memory regions. Theseexamples point to the need for a modular memorymanagement component to experiment with various memoryallocation strategies.The ability to capture performance metric data andprofiling various blocks of code is critical for identificationand resolution of performance bottlenecks. Fine-grainedprofiling available through the profiling layer allows forcapturing relevant data regarding memory allocation/deallocation for any given set of ECs to aid in postmortemanalysis and debugging.E. High-Performance NetworkingThe network component is highly threaded andcompletely integrated into the runtime. Independentcommunication progress with maximal overlap ofcomputation and communication is achieved by havingnetwork ECs dedicated for communication. The scheduler isconfigured to either cooperatively schedule all EC types(compute, network and I/O) across available VPs or to haveone or more VPs dedicated for communication progress.This provides the flexibility to allocate the required mix ofresources for network- or compute-bound applications.Phoenix provides a modular networking component toenable implementation of parallel programming models,such as MPI, that span nodes. The network componentexposes an active messages (AM) API as well as nativesupport for RDMA transfers. These communicationsemantics enable layering of the send/receive model ofcommunication prevalent in MPI as well as global sharedmemory models utilized by PGAS languages such as UPC[18] and CAF [15]. The current prototype supportsInfiniband and TCP with work underway to support iWarp[19] for RDMA-based Ethernet network.III.MESSAGE PASSING INTERFACE ON PHOENIXMPI is the de facto programming model for parallelcomputing on distributed shared memory machines. We havedeveloped an MPICH-based implementation of MPI overPhoenix, called MPICH-Phoenix. This is a proof-of-conceptimplementation to show various advantages that Phoenix canbring to production-class parallel applications on CMPplatforms.By virtue of its design optimized for CMP, MPICHPhoenix more effectively exploits CMP resources comparedto mainstream MPI implementations [6,7], resulting inimproved performance. The improvement in performanceresults primarily from high-performance shared memorycommunication, cache blocking and communication/computation overlap.Before we explain these factors at length, we will brieflydescribe a few important architectural and implementationdetails of MPICH-Phoenix. As mentioned earlier, each MPItask is mapped to a single EC in our implementation.Multiple MPI tasks may be active within a single processaddress space. This necessitates privatization of globalvariables to resolve collision of symbols among tasksexecuting within a single Phoenix process. Currently, we use

Optimized scheduling strategies for collectives that take intoaccount the message transfer dependencies are beinginvestigated.IV.RESULTSOur experimental testbed consists of the Abe [12] and QP[24] clusters at NCSA. Nodes on Abe consist of dual quadcore (8 processing elements) EM64T processors with a 4 MBL2 cache shared between each pair of cores. The QP clusterconsists of dual dual-core (4 processing elements) Opteronprocessors with a 1 MB L2 cache per processor. A singledata rate (SDR) Infiniband network is employed betweennodes. Intel C and C compilers were used for all tests.A. MPICH-Phoenix Performance Analysis versus SelectedMPI ImplementationsIn this subsection, we show the impact of context switchoverhead, optimized intra-node communication andintegrated scheduling using micro-benchmarks and theMILC QCD application. The performance of MPICHPhoenix is compared against OpenMPI-1.2.1 [7],MVAPICH2-1.0.2p1 and adaptive MPI (AMPI) [9]. Theformer two are well known open-source MPIimplementations typically used on production systems.AMPI, built on the Charm [8] framework, makes for aninteresting candidate for performance comparison since, likeMPICH-Phoenix, it is also designed to execute MPIapplications using processor oversubscription.Fig. 2 shows the intra-node bandwidth for MPI tasksacross all implementations on Abe. MPICH-Phoenixsignificantly outperforms the other implementations for largemessage sizes. OpenMPI and MVAPICH2 both utilizeshared memory segments for communication within a node.This requires two memory copies between the source anddestination process. Since MPICH-Phoenix tasks executewithin the same address space, data is copied directly fromthe source to the destination buffer. Additionally, thePhoenix scheduler is able to co-schedule tasks on the samedie to utilize the increased core-core bandwidth present onthe chip. The peak bandwidth of 10 GB/sec is due to thesetasks being co-scheduled on the same die sharing the sameL2 cache.100000100001000Bandwidth (MB/sec)Elsa [3], a source-to-source translation tool for transformingC/C applications. The standard MPI compiler scriptsencapsulate the privatization process, making it completelytransparent to the user.The existence of multiple MPI tasks within a singleaddress space allows for true zero copy communication,whereby an MPI sender can directly deposit a message to thereceiver’s buffer without the intermediary copies customaryin mainstream MPIs. Moreover, the integrated schedulingand mapping of ECs available through Phoenix allows forcollocation of tasks on the same or nearby physicalsockets/cores. Section IV shows the bandwidth curve ofMPICH-Phoenix compared to other MPI implementations toshow this advantage.Fine-grained parallelism and threading is anotherimportant feature of Phoenix. This allows us to spawn manymore MPI tasks than the number of available processors withminimal overhead. This is commonly referred to asvirtualization. The ratio of MPI tasks to processors is calledthe virtualization factor. Virtualized execution environmentshave been shown to provide several benefits [8]. Avirtualized MPI implementation enables us to implement adata-driven model of execution. In this model, an MPI taskhas dedicated use of a processor until a communicationoperation invoked by it cannot be satisfied immediately. Theruntime, using the specified scheduling strategy, selects anMPI task in a runnable state for execution on the blockingprocessor. This allows for automatic overlap of computationand communication, resulting in efficient utilization of CPUresources. Once the offending communication operation hascompleted, the original task is made runnable and availablefor scheduling.Virtualization also allows cache blocking in applications.Cache blocking is an often used strategy that exploits thememory hierarchy for higher performance. Cache-blockedalgorithms attempt to increase cache

during a run by end user tools or even Phoenix components such as the scheduler to make informed decisions C. Scheduler The Phoenix scheduler dispatches ECs for execution on VPs. Phoenix implements cooperative scheduling semantics. Higher level programming models utilize this to