Lecture 7: Distributed Memory

Transcription

Lecture 7: Distributed memoryDavid Bindel15 Feb 2010

LogisticsHW 1 due Wednesday:I See wiki for notes on:IIIIIBottom-up strategy and debuggingMatrix allocation issuesUsing SSE and alignment commentsTiming and OS schedulingGnuplot (to go up shortly)ISubmit by emailing me a tar or zip file.IFor your writeup, text or PDF preferred!IPlease also submit a feedback form (see web page).Next HW: a particle dynamics simulation.

Plan for this weekILast week: shared memory programmingIIIIShared memory HW issues (cache coherence)Threaded programming concepts (pthreads and OpenMP)A simple example (Monte Carlo)This week: distributed memory programmingIIIDistributed memory HW issues (topologies, cost models)Message-passing programming concepts (and MPI)A simple example (“sharks and fish”)

Basic questionsHow much does a message cost?ILatency: time to get between processorsIBandwidth: data transferred per unit timeIHow does contention affect communication?This is a combined hardware-software question!We want to understand just enough for reasonable modeling.

Thinking about interconnectsSeveral features characterize an interconnect:ITopology: who do the wires connect?IRouting: how do we get from A to B?ISwitching: circuits, store-and-forward?IFlow control: how do we manage limited resources?

Thinking about interconnectsILinks are like streetsISwitches are like intersectionsIHops are like blocks traveledIRouting algorithm is like a travel planIStop lights are like flow controlIShort packets are like cars, long ones like buses?At some point the analogy breaks down.

Bus topologyP0P1 P2 P3 MemIIOne set of wires (the bus)Only one processor allowed at any given timeIIContention for the bus is an issueExample: basic Ethernet, some SMPs

CrossbarP0P1P2P3P0P1P2P3IDedicated path from every input to every outputIITakes O(p2 ) switches and wires!Example: recent AMD/Intel multicore chips(older: front-side bus)

Bus vs. crossbarICrossbar: more hardwareIBus: more contention (less capacity?)Generally seek happy mediumIIIILess contention than busLess hardware than crossbarMay give up one-hop routing

Network propertiesThink about latency and bandwidth via two quantities:IIDiameter: max distance between nodesBisection bandwidth: smallest bandwidth cut to bisectIParticularly important for all-to-all communication

Linear topologyIp 1 linksIDiameter p 1IBisection bandwidth 1

Ring topologyIp linksIDiameter p/2IBisection bandwidth 2

MeshIMay be more than two dimensionsIRoute along each dimension in turn

TorusTorus : Mesh :: Ring : Linear

HypercubeILabel processors with binary numbersIConnect p1 to p2 if labels differ in one bit

Fat treeIProcessors at leavesIIncrease link bandwidth near root

Others.IButterfly networkIOmega networkICayley graph

Current pictureIIOld: latencies hopsNew: roughly constant latency (?)IIIIIIWormhole routing (or cut-through) flattens latencies vsstore-forward at hardware levelSoftware stack dominates HW latency!Latencies not same between networks (in box vs across)May also have store-forward at library levelOld: mapping algorithms to topologiesNew: avoid topology-specific optimizationIIIWant code that runs on next year’s machine, too!Bundle topology awareness in vendor MPI libraries?Sometimes specify a software topology

α-β modelCrudest model: tcomm α βMItcomm communication timeIα latencyIβ inverse bandwidthIM message sizeWorks pretty well for basic guidance!Typically α β tflop . More money on network, lower α.

LogP modelLike α-β, but includes CPU time on send/recv:ILatency: the usualIOverhead: CPU time to send/recvIGap: min time between send/recvIP: number of processorsAssumes small messages (gap bw for fixed message size).

Communication costsSome basic goals:IIPrefer larger to smaller messages (avoid latency)Avoid communication when possibleIIGreat speedup for Monte Carlo and other embarrassinglyparallel codes!Overlap communication with computationIModels tell you how much computation is needed to maskcommunication costs.

Message passing programmingBasic operations:IPairwise messaging: send/receiveICollective messaging: broadcast, scatter/gatherICollective computation: sum, max, other parallel prefix opsIBarriers (no need for locks!)IEnvironmental inquiries (who am I? do I have mail?)(Much of what follows is adapted from Bill Gropp’s material.)

MPIIMessage Passing InterfaceIAn interface spec — many implementationsIBindings to C, C , Fortran

Hello world#include mpi.h #include stdio.h int main(int argc, char** argv) {int rank, size;MPI Init(&argc, &argv);MPI Comm rank(MPI COMM WORLD, &rank);MPI Comm size(MPI COMM WORLD, &size);printf("Hello from %d of %d\n", rank, size);MPI Finalize();return 0;}

CommunicatorsIIProcesses form groupsMessages sent in contextsISeparate communication for librariesIGroup context communicatorIIdentify process by rank in groupIDefault is MPI COMM WORLD

Sending and receivingNeed to specify:I What’s the data?IIDifferent machines use different encodings (e.g.endian-ness) “bag o’ bytes” model is inadequateIHow do we identify processes?IHow does receiver identify messages?IWhat does it mean to “complete” a send/recv?

MPI datatypesMessage is (address, count, datatype). Allow:IBasic types (MPI INT, MPI DOUBLE)IContiguous arraysIStrided arraysIIndexed arraysIArbitrary structuresComplex data types may hurt performance?

MPI tagsUse an integer tag to label messagesIHelp distinguish different message typesICan screen messages with wrong tagIMPI ANY TAG is a wildcard

MPI Send/RecvBasic blocking point-to-point communication:intMPI Send(void *buf, int count,MPI Datatype datatype,int dest, int tag, MPI Comm comm);intMPI Recv(void *buf, int count,MPI Datatype datatype,int source, int tag, MPI Comm comm,MPI Status *status);

MPI send/recv semanticsISend returns when data gets to systemIIRecv ignores messages that don’t match source and tagII. might not yet arrive at destination!MPI ANY SOURCE and MPI ANY TAG are wildcardsRecv status contains more info (tag, source, size)

Ping-pong pseudocodeProcess 0:for i 1:ntrialssend b bytes to 1recv b bytes from 1endProcess 1:for i 1:ntrialsrecv b bytes from 0send b bytes to 0end

Ping-pong MPIvoid ping(char* buf, int n, int ntrials, int p){for (int i 0; i ntrials; i) {MPI Send(buf, n, MPI CHAR, p, 0,MPI COMM WORLD);MPI Recv(buf, n, MPI CHAR, p, 0,MPI COMM WORLD, NULL);}}(Pong is similar)

Ping-pong MPIfor (int sz 1; sz MAX SZ; sz 1000) {if (rank 0) {clock t t1, t2;t1 clock();ping(buf, sz, NTRIALS, 1);t2 clock();printf("%d %g\n", sz,(double) (t2-t1)/CLOCKS PER SEC);} else if (rank 1) {pong(buf, sz, NTRIALS, 0);}}

Running the codeOn my laptop (OpenMPI)mpicc -std c99 pingpong.c -o pingpong.xmpirun -np 2 ./pingpong.xDetails vary, but this is pretty normal.

Approximate α-β parameters (2-core 060008000100001200014000bα 1.46 10 6 , β 3.89 10 101600018000

Where we are nowCan write a lot of MPI code with 6 operations we’ve seen:IMPI InitIMPI FinalizeIMPI Comm sizeIMPI Comm rankIMPI SendIMPI Recv. but there are sometimes better ways.Next time: non-blocking and collective operations!

Logistics HW 1 due Wednesday: I See wiki for notes on: I Bottom-up strategy and debugging I Matrix allocation issues I Using SSE and alignment comments I Timing and OS scheduling I Gnuplot (to go up shortly) I Submit by emailing me a tar or zip file. I For your writeup, text or PDF preferred! I Please also submit a feedback form (see web page). Next HW: a particle dynamics simulation.