FORTRAN And MPI Message Passing Interface (MPI)

Transcription

FORTRAN and MPIMessage Passing Interface (MPI)Day 3Maya Neytcheva, IT, Uppsala University maya@it.uu.se1Course plan: MPI - General conceptsCommunications in MPI– Point-to-point communications– Collective communicationsParallel debuggingAdvanced MPI: user-defined data types, functions– Linear Algebra operationsAdvanced MPI: communicators, virtual topologies– Parallel sort algorithmsParallel performance. Summary. TendenciesMaya Neytcheva, IT, Uppsala University maya@it.uu.se2

Memory organization for multiprocessor systemsComputer memories are known to be never big enoughfor what you want to do.The hardship of the unanimous programmer. (Year unknown.)Two major solutions have been offered:- a shared memory model, combined with the so-called interleaved storage(Cray-1, CDC Cyber 205, Fujitsu, NEC SX, CG Power Challenge, Teracomputers etc.),- distributed memory model (CM-2/200, Intel Paragon, Intel iPSC/860,MasPar, nCUBE2, Cray T3D/T3E etc.)All memory systems are, in addition, hierarchical.Each memory system can be viewed as a complex of several types of memories withdifferent capacity, response time and price.Maya Neytcheva, IT, Uppsala University maya@it.uu.se3Memory organizationComputer memories are characterized by capacity (size) functional designresponse time- access time- memory cycle time- memory bandwidthMeasured in bytes (B). Ranges from severaltens of B for registers up to hundreds of GBfor disks.hierarchy, protectionaccess time and memory cycle timealso called latency - is the time needed for thememory to respond to a read or write requestis the minimum period between two successiverequests to the memorythe rate a which data from/to memory can betransferred to/from CPU.Maya Neytcheva, IT, Uppsala University maya@it.uu.se4

Memory organizationRe: memory cycle timeFor many reasons, if a memory has say, 80 ns response time, it can not beaccessed every 80 ns.Furthermore, the cycle time can be longer than the access time.The latter can, for example, be a consequence of the so-called destructiveread, which means that after reading, the contents of the memory cell isdestroyed and must be recovered afterwards, which causes time.Maya Neytcheva, IT, Uppsala University maya@it.uu.se5Memory organizationRe: memory latency - the most difficult issueThe access time per word varies from 50 ns for the chips in today’s personalcomputers to 10 ns or even less for cache memories.It includes time to select the right memory chip (among several hundreds) but also time spent waiting for the bus to finish a previous transaction before thememory request is initialized.Only after that the contents of the memory can be sent along the bus (orvia some other interconnection) to registers.Maya Neytcheva, IT, Uppsala University maya@it.uu.se6

Memory organizationThe Latency ProblemThe larger the latency, the slower the memory is.For certain RAM chips, called Dynamic RAMs, with a latency of about 50ns, we could have1 access 20 million accesses per second50 nanosecondsBut typical desktop computers have a memory bus speed of 100 MHz, or100 million memory accesses per second. How can we resolve this disparitybetween the memory latency and the bus speed?Maya Neytcheva, IT, Uppsala University maya@it.uu.se7Memory organizationA bus is simply a circuit that connects one part of the motherboard toanother. The more data a bus can handle at one time, the faster it allowsinformation to travel. The speed of the bus, measured in megahertz (MHz),refers to how much data can move across the bus simultaneously.Bus speeds can range from 66 MHz to over 800 MHz and can dramaticallyaffect a computer’s performance.The above characteristics have slightly changed but the principle remains.Maya Neytcheva, IT, Uppsala University maya@it.uu.se8

Hierarchical structure of the memorycachescapacitypriceregisterssecondary storageaccess timeperformancemain memoryThe memory, closest to the processor registers, is known as cache. It wasintroduced in 1968 by IBM for the IBM System 360, Model 85. Cachesare intended to contain the most recently used blocks of the main memoryfollowing the principle ”The more frequently data are addressed, the fasterthe access should be”.Maya Neytcheva, IT, Uppsala University maya@it.uu.se9Hierarchical structure of the memoryKeywords: cache linescache hitcache missreplacement strategyhit rate cache coherencyFIFO, LRU, .is the percentage of requests that result in hitsand depends on the memory organization andthe replacement strategynot considered a severe issue for distributedmemory computersMaya Neytcheva, IT, Uppsala University maya@it.uu.se10

Memory designNon-Uniform Memory Architecture - NUMAcache-coherent NUMA - ccNUMAComputer memory design used in multiprocessors, where the memory accesstime depends on the memory location relative to a processor.Maya Neytcheva, IT, Uppsala University maya@it.uu.se11Memory designNEC-SX-5: Multi Node NUMA MemoryThe main memory configuration of SX-5M Multi Node systems includes both shared and NUMA architecture.Each node has full performance access to its entire local memory, and consistent but reduced performanceaccess to the memory of all other nodes.Access to other nodes is performed through the IXS Internode Crossbar Switch, which provides pagetranslation tables across nodes, synchronization registers, and enables global data movement instructions aswell as numerous cross node instructions. Memory addresses include the node number (as do CPU and IOPidentifiers).Latencies for internode NUMA level memory access are less than most workstation technology NUMAimplementations, and the 8 gigabyte per second bandwidth of just a single IXS channel exceeds the entirememory bandwidth of SMP class systems. Even so, because the internode startup latencies are greater thanlocal memory accesses, internode NUMA access is best utilized by block transfers of data rather than by thetransfer of single data elements.This architecture, introduced with the SX-4 Series, has been popularized by many products and lends itselfto a combination of traditional parallel vector processing (microtasking and macrotasking) combined withmessage passing (MPI). Message passing alone is also highly efficient on the architecture.Maya Neytcheva, IT, Uppsala University maya@it.uu.se12

Memory designALLCACHE KSR1: good ideas which did not surviveThe system hardware which assures the user to view the distributed memories, local toeach processor, as a global shared address space, is the patented ALLCACHE memorymanagement system. It provides the programmer with a uniform 264 byte address spacefor instruction and data, called the System Virtual Address space (SVA). The contentsof SVA locations reside in physically distributed memories, called local caches, each witha capacity of 225 32 MB. These local caches are second level caches, to distinguishfrom the first-level caches, which are the standard instruction and data caches. The datais stored in pages and subpages. Each local cache has 211 2048 pages, each of 128subpages of 27 128B. The unit of data which is exchanged, is a subpage.The consistency is automatically maintained by taking into account the type of memoryreference made by the processors - only read or modify. If the data is only read, theprocessor, which has initiated a memory request, will receive a copy of it and its address.If a modification is going to take place, the processor will receive the only instance of anaddress and its data.Maya Neytcheva, IT, Uppsala University maya@it.uu.se13Memory design: KSR1: The ring of ringsSE:1SE:0SE:0processing node with a local level-2 cacheALLCACHE routing and directory cellThe memory management is done by the so-called ALLCACHE Engines. The engines form a hierarchy ofa potentially unlimited number of levels. The maximal length of the path to fulfill any memory requestis O(log p), where p is the number of processors. One can view the hardware architecture as a fat tree,consisting of rings (called SE : 0, SE : 1 etc.), along which the search for a copy of a requested dataproceeds. Each SE : 0 ring connects 32 PEs. If the data is not found in the local SE : 0 ring, to whichthe processor that issued the request belongs, the request is sent to the upper ring SE : 1 etc. Each levelhas a growing bandwidth: 1 GB/sec for SE : 0, 1, 2 or 4 for SE : 1 and so on.Maya Neytcheva, IT, Uppsala University maya@it.uu.se14

The BlueGeneSystem parameters:ModelClock cycleTheor. peak performancePer Proc. (64-bits)MaximalMain memoryMemory/cardMemory/maximalNo. of processorsCommunication bandwidthPoint-to-point (3-D Torus)Point-to-point (Tree network)BlueGene/L700 MHzBlueGene/P850 MHz2.8 Gflop/s367/183.5 Tflop/s3.4 Gflop/s1.5/3 Pflop/s512 MB16 TB265,5362 GB442 TB4221,184175 MB/s350 MB/s350 MB/s700 MB/sMaya Neytcheva, IT, Uppsala University maya@it.uu.se15The BlueGeneThe BlueGene/L possesses no less than 5 networks,2 of which are of interest for inter-processor communication: a 3-D torusnetwork and a tree network. The torus network is used for most general communication patterns. The tree network is used for often occurring collective communicationpatterns like broadcasting, reduction operations, etc. The hardwarebandwidth of the tree network is twice that of the torus: 350 MB/sagainst 175 MB/s per link.Maya Neytcheva, IT, Uppsala University maya@it.uu.se16

Cache-aware algorithmsBlock-versions of various algorithms, combined with a proper data structure.Maya Neytcheva, IT, Uppsala University maya@it.uu.se17MPI: advanced featuresUser-defined (derived) datatypesMaya Neytcheva, IT, Uppsala University maya@it.uu.se18

Derived datatypesGrouping data for communicationsRe: count and data-type parameters in MPI Send and MPI RecvImagine we have to send three variables from one PE to another, which areas efloatfloatintThe communication still can take place if we provide not all addresses but the address of a the relative displacement of b and nMaya Neytcheva, IT, Uppsala University maya@it.uu.se19Derived datatypesDisplacement of b: 40 24 16Displacement of n: 48 24 24a24b3240n481624Maya Neytcheva, IT, Uppsala University maya@it.uu.se20

Derived datatypesProvide now the following information to the communication system: There are three elements to be transferred The first is float The second is float The third is int In order to find them . the first is displaced 0 bytes from the beginning of the message the second is displaced 16 bytes from the beginning of themessage the third is displaced 24 bytes from the beginning of the message The beginning of the message has an address aMaya Neytcheva, IT, Uppsala University maya@it.uu.se21Derived datatypesThe basic principle behind MPI’a derived datatypes is:– in a new MPI datatype, provide all of the information except the beginningaddress.A general MPI datatype is a sequence of pairs{(t0, d0), (t1, d1), · · · , (tn 1 , dn 1)}(1)where tk ) is a basis MPI datatype and dk is displacement in bytes.For the above example:{(MPI FLOAT,0),(MPI FLOAT,16),(MPI INT,24)}Type map: the sequence (1).Extent: the span from the first byte to the last byte occupied at entries inthe datatype, rounded up to satisfy alignment requirements.Maya Neytcheva, IT, Uppsala University maya@it.uu.se22

Derived datatypesExample: Consider T ype {(double, 0), (char, 8)}.Assume that doubles have to be aligned at addresses that are multiple of 8.The extent of this type is 16.The extent will be the same for T ype {(char, 0), (double, 1)}.Maya Neytcheva, IT, Uppsala University maya@it.uu.se23User-defined datatypes and packingMPI provides a mechanism to build general user-defined data types.However, the construction phase of these is expensive!Thus, an application should use those many times to amortize the ’build’costs.Maya Neytcheva, IT, Uppsala University maya@it.uu.se24

User-defined datatypes and packingExample: We want to sent n number of contiguous elements, all of thesame type. Consider an array a(n,n). Say, we want to communicate arow in C and a column in Fortran.CFortranMaya Neytcheva, IT, Uppsala University maya@it.uu.se25User-defined datatypes and packingMPI TYPE CONTIGUOUS(count,oldtype,newtype)oldtypecount 3newtypeMaya Neytcheva, IT, Uppsala University maya@it.uu.se26

User-defined datatypes and packingExample: Now, for the same array a(n,n), we want to communicate arow in Fortran and a column in C. Both cases: not contiguous, but havea constant stride.FortranCMaya Neytcheva, IT, Uppsala University maya@it.uu.se27User-defined datatypes and packingMPI TYPE count 3, blklen 2, stride 3newtypeMPI TYPE VECTOR(n,1,n,MPI DOUBLE,MPI VDOUBLE)Maya Neytcheva, IT, Uppsala University maya@it.uu.se28

User-defined datatypes and packingExample: Consider an already defined datatype oldtype which has typemap {(double, 0), (char, 8)}.A call toMPI Type Vector(2,3,4,oldtype,newtype)will create type ble,32),(char,40), le,96),(char,104)}I.e., two blocks with three copies of the old type, with a stride 4 16elements between the blocks.Maya Neytcheva, IT, Uppsala University maya@it.uu.se29User-defined datatypes and packingExample:Consider again oldtype with type map {(double, 0), (char, 8)}.A call toMPI Type Vector(3,1,-2,oldtype,newtype)will create type ouble,-64),(char,-56)}Maya Neytcheva, IT, Uppsala University maya@it.uu.se30

User-defined datatypes and packingAnother scenario: Communicate the upper triangular part of a squarematrix.nn 1n 2.1Maya Neytcheva, IT, Uppsala University maya@it.uu.se31User-defined datatypes and packingMPI TYPE INDEXED(count,array of blklen,array of displ,.oldtype,newtype)oldtypecount 3, blklen (2,3,1), displ (0,3,8)newtypeMaya Neytcheva, IT, Uppsala University maya@it.uu.se32

User-defined datatypes and packingfor (i 0; i n; i ) {blk len[i] n-1;displ[i] (n 1)*i;}MPI Type Indexed(n,blklen,displ,MPI:Float, &MPI U);MPI Type Commit(&MPI U);if (my rank 0)MPI Send(A,1,MPI U,1,0,MPI COMM WORLD);else /* my rank 1 */MPI Recv(T,1,MPI U,0,0,MPI COMM WORLD,&status);endOBS: Type-vector MPI Type Vector(3,1,-2,.) isinapplicable since the lengths of the portions are different.Maya Neytcheva, IT, Uppsala University maya@it.uu.se33User-defined datatypes and packingMPI TYPE ecount 3, blklen 2, stride 7newtypeIdentical to MPI TYPE VECTOR, however the stride is given in bytes andnot in elements.’H’ stands for homogeneous.Allows for specifying overlapping entries.Maya Neytcheva, IT, Uppsala University maya@it.uu.se34

User-defined datatypes and packingExample: We have assumed that ’double’ cannot start at displacements 0and 4. Consider already defined datatype oldtype which has type map{(double, 0), (char, 8)}.A call toMPI Type Hvector(2,3,4,oldtype,newtype)will create type ble,32),(char,40), e,36),(char,44)}Maya Neytcheva, IT, Uppsala University maya@it.uu.se35User-defined datatypesExample: Task: transpose a matrix:REAL a(100,100), b(100,100)INTEGER disp(2), blocklen(2), type(2), row, row1, sizeofrealINTEGER myrank, ierrINTEGER status(MPI STATUS SIZE)CALL MPI COMM RANK(MPI COMM WORLD, myrank, ierr)C transpose matrix a onto bCALL MPI TYPE EXTENT(MPI REAL, sizeofreal, ierr)C create datatype for one rowC (vector with 100 real entries and stride 100)CALL MPI TYPE VECTOR( 100, 1, 100, MPI REAL, row, ierr)Maya Neytcheva, IT, Uppsala University maya@it.uu.se36

User-defined datatypesExample: Transpose a matrix (cont):C create datatype for a matrix in row-major orderingC (100 copies of the row datatype, strided 1 word apart;C the successive row datatypes are interleaved)CALL MPI TYPE HVECTOR( 100, 1, sizeofreal, row, matrow, ierr)CALL MPI TYPE COMMIT(matrow, ierr)C send matrix in row-major ordering and receive it in column-major ordercall MPI SENDRECV(a,1,matrow, myrank,0,b,100*100,MPI REAL,myrank,0,MPI COMM WORLD,status,ierr)Maya Neytcheva, IT, Uppsala University maya@it.uu.se37User-defined datatypesExample: Transpose a matrix 111110000NOTE!! The whole array is LOCAL for the process (processes).Maya Neytcheva, IT, Uppsala University maya@it.uu.se38

User-defined datatypesMPI TYPE STRUCT(count,array of blklen,array of displ,array of types)oldtypescount 3, blklen (2,3,4), displ (0,7,16)newtypeMaya Neytcheva, IT, Uppsala University maya@it.uu.se39User-defined datatypesMPI TYPE STRUCT allows to describe a collection of data items of variouselementary and derive types as a single data structure.Data is viewed as a set of blocks, each of whih – with its own count anddata type, and a location, given as a displacement.OBS! The displacement need not be relative to the beginning of a particularstructure. They can be given as absolute addresses as well.In this case they are treated as relative to the starting address in memory,given asMPI BOTTOMMPI Bcast(MPI BOTTOM,1,struct type,0,comm)Maya Neytcheva, IT, Uppsala University maya@it.uu.se40

User-defined datatypes - structExample: Another approach to the transpose problem (cont.)REAL a(100,100), b(100,100)INTEGER disp(2), blocklen(2), type(2), row, row1, sizeofrealINTEGER myrank, ierrINTEGER status(MPI STATUS SIZE)CALL MPI COMM RANK(MPI COMM WORLD, myrank)C transpose matrix a onto bCALL MPI TYPE EXTENT( MPI REAL, sizeofreal, ierr)C create datatype for one rowCALL MPI TYPE VECTOR( 100, 1, 100, MPI REAL, row, ierr)Maya Neytcheva, IT, Uppsala University maya@it.uu.se41User-defined datatypes - structExample: Another approach to the transpose problem (cont.)C create datatype for one row, with the extent of one real numberdisp(1) 0disp(2) sizeofrealtype(1) rowtype(2) MPI UBblocklen(1) 1blocklen(2) 1CALL MPI TYPE STRUCT( 2, blocklen, disp, type, row1, ierr)CALL MPI TYPE COMMIT( row1, ierr)C send 100 rows and receive in column major orderCALL MPI SENDRECV( a,100,row1,myrank, 0,b,100*100,MPI REAL,myrank, 0,MPI COMM WORLD, status, ierr)Maya Neytcheva, IT, Uppsala University maya@it.uu.se42

User-defined datatypes - struct/* set up 4 blocks */intblockcounts[4] {50,1,4,2};MPI datatype types[4];MPI Aintdispls[4];/* initialize types and displs with addresses of items */MPI Address(&cmd.display,&displs[0] );MPI Address(&cmd.max,&displs[1] );MPI Address(&cmd.min,&displs[2] );MPI Address(&cmd.error,&displs[3] );types[0] MPI CHAR;types[1] MPI INT;types[2] MPI INT;types[3] MPI double;for (i 3; i 0;i--)displs[i]- displs[0];MPI\ Type struct(4,blockcounts,displs,types,&strtype);MPI\ Type comit(&strtype);MPI\ Bcast(cmd,1,strtype,MPI\ COMM\ WORLD);Maya Neytcheva, IT, Uppsala University maya@it.uu.se43Datatype Constructors - additionalMPI TYPE COMMIT(data type)MPI TYPE FREE(data type)MPI TYPE EXTENT(data type,extent)Returns the extent of a datatype.Can be used for both primitive and derived data types.MPI TYPE SIZE(data type,size)Returns the total size (in bytes) of the entries in the type signature,associated with the data type, i.e., the total size of the data in the messagethat would be created with this datatype.{(double, 0), (char, 8)}Extent is equal to 16Size is equal to 9.Maya Neytcheva, IT, Uppsala University maya@it.uu.se44

Datatype ConstructorsCALL MPI Type vector ( 1, 1, 1,MPI DOUBLE PRECISION,MPI REAL 8, ierr )if ( ierr .NE. MPI SUCCESS ) thenstopend ifCALL MPI Type commit ( MPI REAL 8, ierr )if ( ierr .NE. MPI SUCCESS ) thenstopendifc ------------ - - - - - - - - - - - - - - - - - - - - - - - - - - CALL MPI Type vector ( sgridy*sgridz, 1, sgridx, MPI REAL 8, type fixed x, ierr )CALL MPI Type commit ( type fixed x, ierr ) CALL MPI Type vector ( sgridz, sgridx, sgridy,MPI REAL 8,type fixed y, ierr )CALL MPI Type commit ( type fixed y, ierr )Maya Neytcheva, IT, Uppsala University maya@it.uu.se45Datatype Constructorscc--------- fetch from EAST: [xv(i,j,k) x(i distx,j,k)]if (NEWS27(1) .ne. 999) thencall MPI SENDRECV(xv(nanrx 1,1,1),1,type fixed x,NEWS27(1),1, xv(nanrx,1,1), 1,type fixed x,NEWS27(1),2, MPI COMM WORLD,status,ierr)endif---------- fetch from NORTH: [xv(i,j,k) x(i,j disty,k)]if (NEWS27(3) .ne. 999) thencall MPI SENDRECV(xv(1,nanry 1,1),1,type fixed y,NEWS27(3),3, xv(1,nanry,1), 1,type fixed y,NEWS27(3),4, MPI COMM WORLD,status,ierr)endifMaya Neytcheva, IT, Uppsala University maya@it.uu.se46

Type matchingif (my rank 0)MPI Send(mesage,send count,send type,1,0,comm);else if (my rank 1)MPI Recv(mesage,resv count,recv type,0,0,comm);Must send type be identical to send recv?Type signature of the derived datatype: {t0, t1, · · · , tn 1 }Fundamental MPI rule:the type signature of the sender and receiver must be compatible.{ts0, ts1, · · · , tsn 1} {tr0, tr1, · · · , trm 1 }then n m and tsi tri for i 1, · · · , n 1.Maya Neytcheva, IT, Uppsala University maya@it.uu.se47Type matching, cont.Example: Array A(10, 10), floatTask: Receive a column of it into a row of another array of the same size.if (my rank 0)MPI Send(&A([0],[0]),1,col type,1,0,comm,comm);else if (my rank 1)MPI Recv(&A([0],[0]),10,MPI Float,0,0,comm,&stat);endifMaya Neytcheva, IT, Uppsala University maya@it.uu.se48

Basic Linear Algebra Subroutines (BLAS)Maya Neytcheva, IT, Uppsala University maya@it.uu.se49Basic Linear Algebra Subroutines (BLAS)The BLAS routines fall into three categories, depending on whether theoperations involve vectors or matrices:(i) vector or scalar operations - Level 1 BLAS;(ii) matrix-vector operations - Level 2 BLAS;(iii) matrix-matrix operations - Level 3 BLAS.Maya Neytcheva, IT, Uppsala University maya@it.uu.se50

Basic Linear Algebra Subroutines (BLAS)The BLAS 1 subroutines perform low granularity operations on vectors thatinvolve one or two vectors as input and return either a vector or a scalar asoutput. In other words, O(n) operations are applied on O(n) data, wheren is the vector length.Some of the BLAS -1 operations:y ax yx axy xdot xT ynrm2 kxk2vector updatevector copydot productvector normMaya Neytcheva, IT, Uppsala University maya@it.uu.se51Basic Linear Algebra Subroutines (BLAS)BLAS 2 perform operations of a higher granularity than BLAS Level1 subprograms. These include matrix- vector operations, i.e., O(n2)operations, applied to O(n2) of data. The major operations:y αAx βyy αAT x βyy T xA αxyT AH αxyH ᾱyxH Hy T xy T 1xMaya Neytcheva, IT, Uppsala University maya@it.uu.seT is a triangular matrixrank-one updaterank-two update, H is hermitianmultiplication by a triangularsystemsolution of a system with atriangular matrix52

Basic Linear Algebra Subroutines (BLAS)BLAS 3 are aimed at matrix-matrix operations, i.e., O(n3) operations,applied to O(n2 ) of data.Some of the level-3 routines:C αAAT βCC αAB T αBAT βCB αT 1Bmatrix rank-one updatematrix rank-two updatesolution of a system with atriangular matrix and many righthand sidesMaya Neytcheva, IT, Uppsala University maya@it.uu.se53Basic Linear Algebra Communication Subroutines (BLACS)BLACS aim at ease of programming, ease of use and portability.BLACS serve a particular audience and operate on 2D rectangular matrices(scalars, vectors, square, triangular, trapezoidal matrices are particularcases).Syntax: vXXYY2D- v - the type of the objects (I,S,D,C,Z);- XX - indicates the shape of the matrix (GE, TR)- YY - the action (SD (send), RV, BS, BR (broadcast/receive))vGESD2D(M, N, A, LDA, RDEST, CDEST)vGERV2D(M, N, A, LDA, RDEST, CDEST)vTRSD2D(UPLO, DIAG, M, N, A, LDA, RDEST, CDEST)Maya Neytcheva, IT, Uppsala University maya@it.uu.se54

Basic Linear Algebra Communication Subroutines (BLACS)HereA(M, N ) is the matrixLDA - leading matrix dimensionRDEST - row index of destination processCDEST - column index of destination processExample of a broadcasting routine call:vGEBS2D(’scope’, ’topology’, M, N, A, LDA)where’scope’’topology’can becan be’column’, ’row’;’hypercube’, ’tree’Maya Neytcheva, IT, Uppsala University maya@it.uu.se55Linear Algebra problemsOperations on matrices and vectors. dense matrix linear algebra sparse matrix linear algebraMaya Neytcheva, IT, Uppsala University maya@it.uu.se56

Dense matrix linear algebraA(n, n), B(n, n), C(n, n), v(n, 1), w(n, 1)Maya Neytcheva, IT, Uppsala University maya@it.uu.se57Data storage and relatedMatrix-vector multiplication w Avw 0do i 1,ndo j 1,nw(i) w(i) -----------------do j 1,ndo i 1,nw(i) w(i) A(i,j)*v(j)enddoenddoNote: w(:) w(:) v(j)*A(:,j) is a vector operation!Maya Neytcheva, IT, Uppsala University maya@it.uu.se58

Dense matricesP0P1P0P1P21 4 72 5 83 6 91234567P2 8Striped partitionings9(a) block1 2 3123445678(b) cyclic91 4 P6P7P89(c) blockCheckerboard partitionings(d) cyclicMaya Neytcheva, IT, Uppsala University maya@it.uu.se59Dense matricesP0P1P2P0P1P2P3P4P5P3P4P5P6P7P8P6P7P8(e) column-wise one-to-allbroadcast(f) row-wise accumulationCommunications during matrix-vector multiplicationfor block checkerboard partitioning A11 A12 A13v1A11v1 A12v2 A13v3 A21 A11 A23 v2 A21v1 A22v2 A23v3 A31 A32 A33v3A31v1 A32v2 A33v3Maya Neytcheva, IT, Uppsala University maya@it.uu.se60

Matrix - matrix multiplicationw 0do i 1,ndo j 1,ndo k 1,nC(i,j) C(i,j) A(i,k)*K(k,j)endenddoenddoSerial complexity: n3Maya Neytcheva, IT, Uppsala University maya@it.uu.se61Matrix - matrix A31A32A33B31B32B33Scenario 1:All-to-all on each row of AAll-to-all on each column of BLocal multiplications and additionsBoth communication and memory demanding!Maya Neytcheva, IT, Uppsala University maya@it.uu.se62

Matrix - matrix multiplicationScenario 2: Cannon’s algorithmL.E. Cannon, A cellular computer to implement the Kalman Filter Algorithm,Ph.D. thesis, Montana State University, Bozman, MT, 1969.Let A, B, C be n n and the number of processors be p.The matrices A, B and C are partitioned in blocks (Aij), B (ij), C (ij)).whenever A(ik) and B (kj) happen to be in the processor (i, j), they aremultiplied and accumulated into C (ij).Maya Neytcheva, IT, Uppsala University maya@it.uu.se63Matrix - matrix multiplicationBAMaya Neytcheva, IT, Uppsala University maya@it.uu.sePij64

Matrix - matrix multiplicationCannon’s algorithmfor i 1 : n% row-wise(ij)assign Ato processor Pi,(i j) mod nendfor j 1 : n% column-wise(ij)assign Bto processor P(i j) mod n,jendfor k 1 : nforall (i 1 : n, j 1 : n)C (ij) C (ij) A(ik) B (kj)Left circular shift each block row of AUpward circular shift each block column of BendendMaya Neytcheva, IT, Uppsala University maya@it.uu.se65Matrix - matrix multiplicationAssume that each processor holds blocks of size ( np np ). The algorithmn2requires 4p 2 communication steps, during eachamount of words isptransferred. Thus, for the parallel time we obtainn2n3 (4p 2) ,Tp ppcompared to n3 in the serial case. For the speedup and efficiency the figuresare (replacing 4p 2 by 4p)S 11 4 p n,Maya Neytcheva, IT, Uppsala University maya@it.uu.seE 14p1 n.(2)66

Matrix - matrix multiplicationRelation (2) shows that if p is such that n 36p, for instance, the efficiencybecomes above 0.9.However, this algorithm has the disadvantage that it does not take intoaccount whether the data layout it requires is suitable for other matrixoperations.Maya Neytcheva, IT, Uppsala University maya@it.uu.se67Sparse matricesMaya Neytcheva, IT, Uppsala University maya@it.uu.se68

Sparse matricesgather x(i)*y(:)scatterShared memory A(:,i)y(:)y(1:n)y(1:n)Y (:) Y (:) x(i) A(i, :)Maya Neytcheva, IT, Uppsala University maya@it.uu.se69Sparse matricesDistributed memoryP3132333435 36 P2526 2728291920212223 2413141516177891011 121234520Ω1Ω3Ω2186PP31(g)30(h)Grid-wise mapping of a discrete problem onto distributed memorycomputerMaya Neytcheva, IT, Uppsala University maya@it.uu.se70

Sparse matrices A11 A12 A21 A22A . 0(Block-)Tridiagonal Systems 0 A23 . . . or A tridiag (Ai,i 1, Ai,i, Ai,i 1).An,n 1 An,nΩ3Ω2Γ1Ω1Ω3Ω2Ω4Γ2(i) Subdomain division and ordering.Ω4Ω1(j) Subdomain division of a network.Maya Neytcheva, IT, Uppsala University maya@it.uu.se71Sparse matricesGiven a tridiagonal matrix A. The usual way is to factorize it as A LUand then solve Ax b asLz b and U x z.Both factorization (Gauss elimination) and the solution of triangular (in thiscase, lower- and upper-bidiagonal) systems is PURELY SEQUENTIAL byits nature !!!forwardsubstitution:Lz b, i.e.,z1 b1i 1Pzi bi li,k zk , i 2, 3, · · · , n.k 1backwardsubstitution:U x z, i.e.,xn znxi zi nPk i 12 6 64000 000 2 332 30b1z16b2 76z2 70776 7 6 74b3 50 5 4z3 5b4 z4ui,k xk , i n 1, · · · , 1.Maya Neytcheva, IT, Uppsala University maya@it.uu.se72

Sparse matricesIn order to gain some parallelism when solving linear recursions, somespecial techniques have been studied: Multifrontal solution methods Odd-even elimination metho

System Virtual Address space (SVA). The contents of SVA locations reside in physically distributed memories, called local caches, each with a capacity of 2 25 32 MB. These local caches are second level caches, to distingui sh from the rst-level caches, which are the standard instruct ion and data caches. The data is stored in pages and subpages.