Lecture 13 (part 2) Data Level Parallelism (1) - UC Davis

Transcription

Lecture 13 (part 2)Data Level Parallelism (1)EEC 171 Parallel ArchitecturesJohn OwensUC Davis

Credits John Owens / UC Davis 2007–9.Thanks to many sources for slide material: ComputerOrganization and Design (Patterson & Hennessy) 2005, Computer Architecture (Hennessy & Patterson) 2007, Inside the Machine (Jon Stokes) 2007, Dan Connors / University of Colorado 2007, KathyYelick / UCB 2007, Wen-Mei Hwu/David Kirk,University of Illinois 2007, David Patterson / UCB2003–7, John Lazzaro / UCB 2006, Mary JaneIrwin / Penn State 2005, John Kubiatowicz / UCB2002, Krste Asinovic/Arvind / MIT 2002, MorganKaufmann Publishers 1998.

DLP Outline Today: SIMD instructionsUpcoming: Vector machinesVector efficiencyMassively parallel machinesGPUsStream processorsAlgorithms and programming models

What We thProcessorNetworkControlDatapath

Cook analogy We want to prepare food for several banquets, each ofwhich requires many dinners. We have two positions we can fill: The boss (control), who gets all the ingredients and tellsthe chef what to do The chef (datapath), who does all the cooking

Cook analogy ILP is analogous to: One ultra-talented boss with many handsOne ultra-talented chef with many handsTLP is analogous to: Building multiple kitchens, each with a boss and a chefPutting more bosses and chefs in the same kitchen

Cook analogy DLP is analogous to: One bossLots of cloned chefs

Flynn’s Classification Scheme SISD – single instruction, single data stream SIMD – single instruction, multiple data streams single control unit broadcasting operations to multiple datapathsMISD – multiple instruction, single data Uniprocessorsno such machine (although some people put vector machines in thiscategory)MIMD – multiple instructions, multiple data streams aka multiprocessors (SMPs, MPPs, clusters, NOWs)

Performance beyond single thread ILP There can be much higher natural parallelism in some applications(e.g., database or scientific codes) Explicit Thread Level Parallelism or Data Level Parallelism Thread: process with own instructions and data Thread may be a subpart of a parallel program (“thread”), or it may bean independent program (“process”) Each thread has all the state (instructions, data, PC, register state, andso on) necessary to allow it to executeData Level Parallelism: Perform identical operations on data, and(possibly) lots of data

Continuum of Granularity “Coarse” “Fine” Each processor is morepowerful Each processor is lesspowerful Usually fewerprocessors Usually moreprocessors Communication is moreexpensive betweenprocessors Communication ischeaper betweenprocessors Processors are moreloosely coupled Processors are moretightly coupled Tend toward MIMD Tend toward SIMD

ILP vs. TLP “SIMD is about exploiting parallelism in the datastream, while superscalar SISD is about exploitingparallelism in the instruction pu/simd.ars

What We Know

What’s New

Scalar vs. Vector “The basic unit of SIMD love is the vector, which iswhy SIMD computing is also known as vectorprocessing. A vector is nothing more than a row ofindividual numbers, or scalars.”SIMD architectures by Jon "Hannibal" imd.ars

Representing Vectors Multiple items within same data wordMultiple data words

Motorola AltiVec

Motorola AltiVec Intra element arithmetic and nonarithmetic functions. floating-point arithmeticinstructions floating-point rounding andconversion instructions floating-point compareinstruction integer instructionsinteger rotate and shiftinstructions floating-point estimateinstructions floating-point instructions memory access instructionsinteger arithmetic instructionsinteger compare instructions

Motorola AltiVec Inter Element Arithmetic: Betweenelements in vector Example: Sum elements acrossvector, store in accumulator alignment support instructionspermutation and formattinginstructions pack instructionsunpack instructionsmerge instructionssplat instructionspermute instructionsshift left/right instructions

Motorola AltiVec Inter Element Non-Arithmetic (vector permute)

AltiVec Architecture 32 128b-wide new architectural registers 16 8b elements, 8 16b elements, 4 32b (FP/int)elements2 fully-pipelined, parallel AltiVec units: Vector Permute Unit (pack, unpack, permute, load/store) Vector ALU (all math)Latency: 1 cycle for simple instrs, 3–4 for complexNo interrupts except load and store

MMX Instructions 57 new instructions2 operands per instruction (like x86) No single-cycle permutes

most significant bits (bits 56 through 63). The words in the packed words data type arenumbered 0 through 4, with word 0 being contained in the bits 0 through 15 of the data type andword 4 being contained in bits 48 through 63. The doublewords in a packed doublewords datatype are numbered 0 and 1, with doubleword 0 being contained in bits 0 through 31 and doubleword 1 being contained in bits 32 through 63.Intel MMX DatatypesPacked bytes (8x8 bits)6356 5548 4740 3932 3124 2316 158 70Packed word (4x16 bits)6348 4732 3116 150Packed doublewords (2x32 bits)6332 310Quadword (64 bits)6303006002Figure 8-2. MMX Data Types

Intel MMXPROGRAMMING WITH THE INTEL MMX TECHNOLOGYTable 8-2. MMX Instruction Set licationMultiply and AddComparisonCompare for EqualCompare forGreater ThanConversionWraparoundPADDB, PADDW,PADDDPSUBB, PSUBW,PSUBDPMULL, PMULHPMADDUnpack gicalAndAnd NotOrExclusive ORShiftShift Left LogicalShift Right LogicalShift RightArithmeticEmptyMMX StateRegister to RegisterLoad from MemoryStore to MemoryFull QuadwordPANDPANDNPORPXORPSLLW, PSLLDPSRLW, PSRLDPSRAW, PSRADDoubleword TransfersData CMPGTPB,PCMPGTPW,PCMPGTPDPackUnpack word TransfersMOVQMOVQMOVQ

MMX Details 8 MMX registers (64b each)Aliased with FP registers “EMMS” (exit MMX) switches between themWhy is this a good idea?Why is this a bad idea?Supports saturating arithmeticProgrammed with intrinsics

MMX Example; DWORD LerpARGB(DWORD a, DWORD b, DWORD f);global LerpARGBTo give you a better understandingof what can be done with MMX I'vewritten a small function that blendstwo 32-bit ARGB pixels using 4 8bit factors, one for each channel. Todo this in C you would have to dothe blending channel by channel.But with MMX we can blend allchannels at once.The blending factor is a one bytevalue between 0 and 255, as is thechannel components. Each channelis blended using the followingformula.res (a*fa mmxLerpARGB:; load the pixels and expand to 4movdmm1, [esp 4]; mm1movdmm2, [esp 8]; mm2pxormm5, mm5; mm5punpcklbwmm1, mm5; mm1punpcklbwmm2, mm5; mm2words 0 0 0 0 aA aR aG 0 0 0 0 bA bR bG 0 0 0 0 0 0 0 0 0 aA 0 aR 0 aG 0 0 bA 0 bR 0 bG 0; load the factor and increase range to [0-256]movdmm3, [esp 12]; mm3 0 0 0 0 faApunpcklbwmm3, mm5; mm3 0 faA 0 faRmovqmm6, mm3; mm6 faA faR faGpsrlwmm6, 7; mm6 faA faR faGpaddwmm3, mm6; mm3 faA faR faGaBbBaBbBfaR faG faB0 faG 0 faBfaB [0 - 255]faB [0 - 1]faB [0 - 256]; fb 256 - fapcmpeqwmm4, mm4psrlwmm4, 15psllwmm4, 8psubwmm4, mm3;;;;mm4mm4mm4mm4 0xFFFF 0xFFFF 0xFFFF 0xFFFF 1111 256 256 256 256 fbA fbR fbG fbB; res (a*fa b*fb)/256pmullwmm1, mm3pmullwmm2, mm4paddwmm1, mm2psrlwmm1, 8;;;;mm1mm2mm1mm1 ; pack into eaxpackuswbmm1, mm1movdeax, mm1; mm1 0 0 0 0 rA rR rG rB; eax rA rR rG rBretaA aR aG aBbA bR bG bBrA rR rG rB0 rA 0 rR 0 rG 0 rB

omeEXRMe ry es -The following pictures show the same OpenEXR image; yet the amountof detail you can see when comparing the three is substantiallydifferent. R2-D2 stands out clearly, but hidden in the top right cornershadows is Darth Vader. Using the Viewer, you can adjust the amountof exposure on the fly to see more details emerge, as the followingimages show.OpenEXRplesplaytionOriginal OpenEXRImageAdjust 3 StopsBrighterAdjust 5 StopsBrighter. with originalExposure setting ofzero (0):. details emergefrom the shadows. and even moredetails emerge from theshadows.adsnksists(click for larger image) (click for largerimage)(click for larger image)http://www.openexr.com/16-bit FP format (1 sign, 10 mantissa, 5 exponent)HistoryNativelysupported (paired “half”) in NVIDIA GPUsILM developed the OpenEXR format in response to the demand forhigher color fidelity in the visual effects industry. When the project

SSE (Streaming SIMD Extensions) Intel 4-wide floating point 8 128-bit registers added to ISA Pentium III: 2 fully-pipelined, SIMD, single-precisionFP unitsIn HW, (historically) broke 4-wide instructions into 2 2wide instructions Interrupts are a problemMMX SSE: added 10% to PIII die

Multimedia Extensions Very short vectors added to existing ISAs for microsUsually 64-bit registers split into 2x32b or 4x16b or 8x8bNewer designs have 128-bit registers (Altivec, SSE2)Limited instruction set: no strided load/store or scatter/gatherunit-stride loads must be aligned to 64/128-bit boundaryLimited vector register length: no vector length controlrequires superscalar dispatch to keep multiply/add/load units busyloop unrolling to hide latencies increases register pressureTrend towards fuller vector support in microprocessors

Performance beyond single thread ILP There can be much higher natural parallelism in some applications (e.g., database or scientific codes) Explicit Thread Level Parallelism or Data Level Parallelism Thread: process with own instructions and data Thread may be a subpart of a parallel program ("thread"), or it may be an independent program ("process")