Parallelism On Supercomputers And The Message Passing Interface (MPI)

Transcription

Parallelism on Supercomputers and theMessage Passing Interface (MPI)Parallel ComputingCIS 410/510Department of Computer and Information ScienceLecture 11 –Parallelism on Supercomputers

OutlineQuick review of hardware architecturesq Running on supercomputersq Message Passingq MPIq Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers2

Parallel Architecture Types Uniprocessor Shared MemoryMultiprocessor (SMP)– Scalar processor– Shared memory address space– Bus-based memory systemprocessormemoryprocessor– Vector processor processorbusprocessorvectormemorymemory– Single Instruction MultipleData (SIMD)– Interconnection networkprocessormemoryIntroduction to Parallel Computing, University of Oregon, IPCCprocessornetwork processor memoryLecture 11 –Parallelism on Supercomputers3

Parallel Architecture Types (2)memorymemory– Shared memory addressingwithin SMP node– Message passing between SMPnodes moryM processormemory– Massively Parallel Processor(MPP) Many, many processorsIntroduction to Parallel Computing, University of Oregon, IPCCnetworkinterfaceP – Message passingbetween nodes Cluster of SMPsM PP Distributed MemoryMultiprocessorPinterconnec2onnetworkP P P PMM– Can also be regarded as MPP ifprocessor number is largeLecture 11 –Parallelism on Supercomputers4

Parallel Architecture Types (3) Multicore SMP GPU Cluster perthread)CCCCmmmmmemory GPUacceleratorprocessor – Shared memory addressingwithin SMP node– Message passing between SMPnodes– GPU accelerators attachedPCImemory“Fused” processoracceleratorprocessormemoryIntroduction to Parallel Computing, University of Oregon, IPCC MP Mul2corePMP r Pinterconnec2onnetworkP PMP P MLecture 11 –Parallelism on Supercomputers5

How do you get parallelism in the hardware?q q Instruction-Level Parallelism (ILP)Data parallelism q Processor parallelism q Increase amount of data to be operated on at same timeIncrease number of processorsMemory system parallelismIncrease number of memory units Increase bandwidth to memory q Communication parallelismIncrease amount of interconnection between elements Increase communication bandwidth Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers6

Distributed Memory ParallelismEach processing elements cannot access all datanativelyq The scale can go up considerablyq Penalty for coordinating with other processingelements is now significantly higherq Approaches change accordinglyIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers7

Scientific Simulation and SupercomputersWhy simulation? q Image credit: Prabhat, LBNLq Simulations are sometimes morecost effective than experimentsWhy extreme scale? More compute cycles, morememory, etc, lead for fasterand/or more accurate simulationsClimate ChangeNuclear ReactorsAstrophysicsIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers8

How big are supercomputers?q Measured in “FLOPs”FLoating point Operations Persecond1 GigaFLOP 1 billion FLOPs 1 TeraFLOP 1000 GigaFLOPs 1 PetaFLOP 1000 TeraFLOPs where we are today 1 ExaFLOP 1000 PetaFLOPs potentially arriving in 2018Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers9

Distributed Memory Multiprocessorsq Each processor has a local memory q Physically separated memory address spaceProcessors must communicate to access non-local data Message communication (message passing) Message passing architecture q Processor interconnection networkParallel applications must be partitioned acrossProcessors: execution units Memory: data partitioning q Scalable architecture Small incremental cost to add hardware (cost of node)Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers10

Distributed Memory (MP) Architectureq Nodes are completecomputer systems q Including I/ONodes communicatevia interconnectionnetworkStandard networks Specialized networks q MThe imagecannot bedisplayed.Yourcomputermay nothaveenoughmemory to PNetwork interfaces q NetworkCommunication integrationMThe imagecannot bedisplayed.Yourcomputermay nothaveenoughmemory to MPThe imagecannot bedisplayed.Yourcomputermay nothaveenoughmemory to PNetworkinterfaceEasier to buildIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers11

Performance Metrics: Latency and Bandwidthq BandwidthNeed high bandwidth in communication Match limits in network, memory, and processor Network interface speed vs. network bisection bandwidth q LatencyPerformance affected since processor may have to wait Harder to overlap communication and computation Overhead to communicate is a problem in many machines q Latency hidingIncreases programming system burden Examples: communication/computation overlaps, prefetch Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers12

Advantages of Distributed Memory ArchitecturesThe hardware can be simpler (especially versusNUMA) and is more scalableq Communication is explicit and simpler tounderstandq Explicit communication focuses attention on costlyaspect of parallel computationq Synchronization is naturally associated withsending messages, reducing the possibility forerrors introduced by incorrect synchronizationq Easier to use sender-initiated communication,which may have some advantages in performanceq Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers13

OutlineQuick review of hardware architecturesq Running on supercomputersq The purpose of these slides is to give context, not toteach you how to run on supercomputersMessage Passingq MPIq Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers14

Running on Supercomputersq Sometimes one job runs on the entire machine,using all processors These are called “hero runs” Sometimes many smaller jobs are running on themachineq For most supercomputer, the processors are beingused nearly continuouslyq The processors are the “scarce resource” and jobs torun on them are “plentiful”Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers15

Running on Supercomputersq You plan a “job” you want to run The job consists of a parallel binary program and an“input deck” (something that specifies input data for theprogram)You submit that job to a “queue”q The job waits in the queue until it is scheduledq The scheduler allocates resources when (i)resources are available and (ii) the job is deemed“high priority”q Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers16

Running on Supercomputersq The scheduler runs scripts that initialize theenvironment q Typically done with environment variablesAt the end of initialization, it is possible to infer:What the desired job configuration is (i.e., how manytasks per node) What other nodes are involved How your node’s tasks relates to the overall program q The MPI library knows how to interpret all of thisinformation and hides the details from youIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers17

UO’s supercomputer: ACISSIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers18

Job submission on ACISSIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers19

Job submission on ACISSIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers20

OutlineQuick review of hardware architecturesq Running on supercomputersq Message Passingq MPIq Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers21

Acknowledgements and Resourcesq Portions of the lectures slides were adopted from:Argonne National Laboratory, MPI tutorials. Lawrence Livermore National Laboratory, MPI tutorials See online tutorial links in course webpage q q W. Gropp, E. Lusk, and A. Skjellum,Using MPI: Portable Parallel Programming with theMessage Passing Interface, MIT Press, ISBN0-262-57133-1, 1999.W. Gropp, E. Lusk, and R. Thakur,Using MPI-2: Advanced Features of the MessagePassing Interface, MIT Press, ISBN 0-262-57132-3,1999.Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers22

Types of Parallel Computing Modelsq Data parallelSimultaneous execution on multiple data items Example: Single Instruction, Multiple Data (SIMD) q Task parallel q Different instructions on different data (MIMD)SPMD (Single Program, Multiple Data)Combination of data parallel and task parallel Not synchronized at individual operation level q Message passing is for MIMD/SPMD parallelism Can be used for data parallel programmingIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers23

The Message-Passing Modelq q A process is a program counter and address spaceProcesses can have multiple threads (program countersand associated stacks) sharing a single address spaceP1P2P3P4processthreadaddress space(memory)q MPI is for communication among processes q Not threadsInterprocess communication consists ofSynchronization Data movement Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers24

SPMD “Owner compute” rule:Process that “owns”the data (local data)performs computationson that dataMul2pledataSharedprogram q Data distributed across processes Not sharedIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers25

Message Passing Programmingq Defined by communication requirementsData communication (necessary for algorithm) Control communication (necessary for dependencies) q q Program behavior determined by communicationpatternsMessage passing infrastructure attempts to support theforms of communication most often used or desired Basic forms provide functional access Can be used most often Complex forms provide higher-level abstractions Serve as basis for extension Example: graph libraries, meshing libraries, Extensions for greater programming powerIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers26

Communication Typesq Two ideas for communicationCooperative operations One-sided operations Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers27

Cooperative Operations for Communicationq q q Data is cooperatively exchanged in message-passingExplicitly sent by one process and received by anotherAdvantage of local control of memory q Any change in the receiving process’s memory is madewith the receiver’s explicit participationCommunication and synchronization are combinedProcess 0Process 1Send(data)Receive(data)timeIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers28

One-Sided Operations for Communicationq One-sided operations between processes q Only one process needs to explicitly participate q Include remote memory reads and writesThere is still agreement implicit in the SPMD programAdvantages? Communication and synchronization are decoupledProcess 0Process on to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers29

Pairwise vs. Collective Communicationq Communication between process pairsSend/Receive or Put/Get Synchronous or asynchronous (we’ll talk about this later) q Collective communication between multiple processes Process group (collective) Several processes logically grouped togetherCommunication within group Collective operations Communication patterns– broadcast, multicast, subset, scatter/gather, Reduction operationsIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers30

OutlineQuick review of hardware architecturesq Running on supercomputersq Message Passingq MPIq Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers31

What is MPI (Message Passing Interface)?q Message-passing library (interface) specification q Targeted for parallel computers, clusters, and NOWs q q q q q Extended message-passing modelNot a language or compiler specificationNot a specific implementation or productNOWs network of workstationsSpecified in C, C , Fortran 77, F90Full-featured and robustDesigned to access advanced parallel hardwareEnd users, library writers, tool developersMessage Passing Interface (MPI) Forum docs/docs.htmlIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers32

Why Use MPI?q Message passing is a mature parallel programmingmodelWell understood Efficient match to hardware (interconnection networks) Many applications q q q q q MPI provides a powerful, efficient, and portable wayto express parallel programsMPI was explicitly designed to enable libraries which may eliminate the need for many users tolearn (much of) MPINeed standard, rich, and robust implementationThree versions: MPI-1, MPI-2, MPI-3 (just released!) Robust implementations including free MPICH (ANL)Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers33

Features of MPIq GeneralCommunicators combine context and group forsecurity Thread safety (implementation dependent) q Point-to-point communicationStructured buffers and derived datatypes, heterogeneity Modes: normal, synchronous, ready, buffered q CollectiveBoth built-in and user-defined collective operations Large number of data movement routines Subgroups defined directly or by topology Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers34

Features of MPI (continued)q Application-oriented process topologies q Built-in support for grids and graphs (based on groups)ProfilingHooks allow users to intercept MPI calls Interposition library interface (PMPI) Many tools (e.g., TAU) use PMPI q EnvironmentalInquiry Error control Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers35

Is MPI Large or Small?q MPI is largeMPI-1 is 128 functions, MPI-2 is 152 functions Extensive functionality requires many functions Not necessarily a measure of complexity q MPI is small (6 functions) q Many parallel programs use just 6 basic functions“MPI is just right,” said Baby BearOne can access flexibility when it is required One need not master all parts of MPI to use it Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers36

To use or not use MPI? That is the question?q USEYou need a portable parallel program You are writing a parallel library You have irregular or dynamic data relationships thatdo not fit a data parallel model You care about performance and have to do Exercise 1 q NOT USEYou don’t need parallelism at all (Ha!) You can use libraries (which may be written in MPI) You can use multi-threading in a concurrentenvironment Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers37

Getting StartedWriting MPI programsq Compiling and linkingq Running MPI programsq Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers38

A Simple MPI Program (C)#include "mpi.h"#include stdio.h int main( int argc, char *argv[] ){MPI Init( &argc, &argv );printf( "Hello, world!\n" );MPI Finalize();return 0;}!r Whatdoesthisprogramdo?Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers39

A Simple MPI Program (C )#include iostream.h #include "mpi .h"int main( int argc, char *argv[] ){MPI::Init(argc,argv);cout "Hello, world!" endln;MPI::Finalize();return 0;}Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers40

A Minimal MPI Program (Fortran)program mainuse MPIinteger ierrcall MPI INIT( ierr )print *, 'Hello, world!'call MPI FINALIZE( ierr )endIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers41

MPI InitWhat happens during MPI initialization?q Think about itq How do hardware resources get allocated?q q Hmm, is this part of MPI?How do processes on different nodes get started? Where does their executable program come from?What do the processes need to know?q What about OS resources?q What about tools that are running with MPI?q q Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers42

MPI FinalizeWhy do we need to finalize MPI?q What happens during MPI finalization?q Think about itq What is necessary for a “graceful” MPI exit?q Can bad things happen otherwise? Suppose the one process exits? How do resources get de-allocated?q What about communications?q What type of exit protocol might be used?q What about tools?q Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers43

Notes on C and FortranC and Fortran library bindings correspond closelyq In C:q mpi.h must be #included MPI functions return error codes or MPI SUCCESS q In Fortran:mpif.h must be included, or use MPI module (MPI-2) All MPI calls are to subroutines place for the return code in the last argumentq C bindings, and Fortran-90 issues, are part ofMPI-2Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers44

Error HandlingBy default, an error causes all processes to abortq The user can cause routines to return (with an errorcode)q In C , exceptions are thrown (MPI-2)A user can also write and install custom errorhandlersq Libraries may handle errors differently fromapplicationsq Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers45

Running MPI Programsq MPI-1 does not specify how to run an MPI programq Starting an MPI program is dependent onimplementation q % mpirun -np procs a.out q Scripts, program arguments, and/or environment variablesFor MPICH under Linuxmpiexec args Recommended part of MPI-2, as a recommendation mpiexec for MPICH (distribution from ANL) mpirun for SGI’s MPIIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers46

Finding Out About the Environmentq Two important questions that arise in messagepassingHow many processes are being use in computation? Which one am I? q MPI provides functions to answer these questionsMPI Comm size reports the number of processes MPI Comm rank reports the rank number between 0 and size-1 identifies the calling processIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers47

Better “Hello World” (C)#include "mpi.h"#include stdio.h int main( int argc, char *argv[] ){int rank, size;MPI Init( &argc, &argv );MPI Comm rank( MPI COMM WORLD, &rank );MPI Comm size( MPI COMM WORLD, &size );printf( "I am %d of %d\n", rank, size );MPI Finalize();return 0;}r WhatdoesthisprogramdoandwhyisitbeKer?Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers48

MPI Basic Send/Receiveq We need to fill in the details in:Process 0Process 1Send(data)Receive(data)timeq Things that need specifying:How will “data” be described? How will “processes” be identified? How will the receiver recognize/screen messages? What will it mean for these operations to complete? Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers49

What is message passing?q Data transfer plus taDataDataDataDataDataYesDataTimeRequires cooperation of sender and receiverq Cooperation not always apparent in codeq Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers50

Some Basic ConceptsProcesses can be collected into groupsq Each message is sent in a contextq Must be received in the same context!A group and context together form a communicatorq A process is identified by its rankq q With respect to the group associated with acommunicatorThere is a default communicatorMPI COMM WORLD Contains all initial processesIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers51

MPI Datatypesq Message data (sent or received) is described by a triple q An MPI datatype is recursively defined as: q address, count, datatypePredefined data type from the languageA contiguous array of MPI datatypesA strided block of datatypesAn indexed array of blocks of datatypesAn arbitrary structure of datatypesThere are MPI functions to construct custom datatypes Array of (int, float) pairsRow of a matrix stored columnwiseIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers52

MPI Tagsq Messages are sent with an accompanying userdefined integer tag q Messages can be screened at the receiving end byspecifying a specific tag q Assist the receiving process in identifying the messageMPI ANY TAG matches any tag in a receiveTags are sometimes called “message types” MPI calls them “tags” to avoid confusion withdatatypesIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers53

MPI Basic (Blocking) SendMPI SEND (start, count, datatype, dest, tag, comm)q The message buffer is described by: q The target process is specified by dest q start, count, datatypeRank of the target process in the communicatorspecified by commProcess blocks until:Data has been delivered to the system Buffer can then be reused q Message may not have been received by targetprocess!Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers54

MPI Basic (Blocking) ReceiveMPI RECV(start, count, datatype, source, tag, comm,status)q Process blocks (waits) until: A matching message is received from system Matches on source and tag q source is rank in communicator specified by comm q q Buffer must be availableOr MPI ANY SOURCEStatus contains further informationReceiving fewer than count is OK, more is notIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers55

Retrieving Further InformationStatus is a data structure allocated in the user’sprogramq In C:q int recvd tag, recvd from, recvd count;MPI Status status;MPI Recv(., MPI ANY SOURCE, MPI ANY TAG, .,&status )recvd tag status.MPI TAG;recvd from status.MPI SOURCE;MPI Get count( &status, datatype, &recvd count );Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers56

Why Datatypes?All data is labeled by type in MPIq Enables heterogeneous communicationq Support communication between processes onmachines with different memory representations andlengths of elementary datatypes MPI provides the representation translation if necessary q Allows application-oriented layout of data inmemoryReduces memory-to-memory copies in implementation Allows use of special hardware (scatter/gather) Introduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers57

Tags and Contextsq Separation of messages by use of tagsRequires libraries to be aware of tags of other libraries This can be defeated by use of “wild card” tags q Contexts are different from tagsNo wild cards allowed Allocated dynamically by the system When a library sets up a communicator for its own use q User-defined tags still provided in MPI q For user convenience in organizing applicationUse MPI Comm split to create newcommunicatorsIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers58

Programming MPI with Only Six Functionsq Many parallel programs can be written using:MPI INIT() MPI FINALIZE() MPI COMM SIZE() MPI COMM RANK() MPI SEND() MPI RECV() What might be not so great with this?q Point-to-point (send/recv) isn’t the only way.q Add more support for communicationIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers59

Introduction to Collective Operations in MPIq Called by all processes in a communicatorq MPI BCAST q Distributes data from one process (the root) to allothersMPI REDUCECombines data from all processes in communicator Returns it to one process q In many numerical algorithms, SEND/RECEIVEcan be replaced by BCAST/REDUCE, improvingboth simplicity and efficiencyIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers60

Summaryq The parallel computing community has cooperatedon the development of a standard for messagepassing librariesq There are many implementations, on nearly allplatformsq MPI subsets are easy to learn and useq Lots of MPI material is availableIntroduction to Parallel Computing, University of Oregon, IPCCLecture 11 –Parallelism on Supercomputers61

Instruction-Level Parallelism (ILP) ! Data parallelism . Message passing is for MIMD/SPMD parallelism Can be used for data parallel programming Introduction to Parallel Computing, University of Oregon, IPCC 23 . Lecture 11 -Parallelism on Supercomputers