Chapter 4 Data-Level Parallelism In Vector, SIMD, And GPU . - NUTN

Transcription

Computer ArchitectureA Quantitative Approach, Fifth EditionChapter 4Data-Level Parallelism inVector, SIMD, and GPUArchitectures

SIMD architectures can exploit significant datalevel parallelism for: matrix-oriented scientific computingmedia-oriented image and sound processorsSIMD is more energy efficient than MIMD IntroductionIntroductionOnly needs to fetch one instruction per data operationMakes SIMD attractive for personal mobile devicesSIMD allows programmer to continue to thinksequentially2

Vector architecturesSIMD extensionsGraphics Processor Units (GPUs) For x86 processors: IntroductionSIMD ParallelismExpect two additional cores per chip per yearSIMD width to double every four yearsPotential speedup from SIMD to be twice that fromMIMD!3

Basic idea: Read sets of data elements into “vector registers”Operate on those registersDisperse the results back into memoryVector ArchitecturesVector ArchitecturesRegisters are controlled by compiler Used to hide memory latencyLeverage memory bandwidth4

Example architecture: VMIPS Loosely based on Cray-1Vector registers Fully pipelinedData and control hazards are detectedVector load-store unit Each register holds a 64-element, 64 bits/element vectorRegister file has 16 read ports and 8 write portsVector functional units Vector ArchitecturesVMIPSFully pipelinedOne word per clock cycle after initial latencyScalar registers 32 general-purpose registers32 floating-point registers5

ADDVV.D: add two vectorsADDVS.D: add vector to a scalarLV/SV: vector load and vector store from addressVector ArchitecturesVMIPS InstructionsExample: DAXPYL.DF0,a; load scalar aLVV1,Rx; load vector XMULVS.DV2,V1,F0; vector-scalar multiplyLVV3,Ry; load vector YADDVVV4,V2,V3; addSVRy,V4; store the resultRequires 6 instructions vs. almost 600 for MIPS6

Execution time depends on three factors: VMIPS functional units consume one elementper clock cycle Length of operand vectorsStructural hazardsData dependenciesVector ArchitecturesVector Execution TimeExecution time is approximately the vector lengthConvey Set of vector instructions that could potentiallyexecute together7

Sequences with read-after-write dependencyhazards can be in the same convey via chainingChaining Vector ArchitecturesChimesAllows a vector operation to start as soon as theindividual elements of its vector source operandbecome availableChime Unit of time to execute one conveym conveys executes in m chimesFor vector length of n, requires m x n clock cycles8

V3,RyV4,V2,V3Ry,V4;load vector X;vector-scalar multiply;load vector Y;add two vectors;store the sumVector ArchitecturesExampleMULVS.DADDVV.D3 chimes, 2 FP ops per result, cycles per FLOP 1.5For 64 element vectors, requires 64 x 3 192 clock cycles9

Start up time Latency of vector functional unitAssume the same as Cray-1 Vector ArchitecturesChallengesFloating-point add 6 clock cyclesFloating-point multiply 7 clock cyclesFloating-point divide 20 clock cyclesVector load 12 clock cyclesImprovements: 1 element per clock cycleNon-64 wide vectorsIF statements in vector codeMemory system optimizations to support vector processorsMultiple dimensional matricesSparse matricesProgramming a vector computer10

Element n of vector register A is “hardwired” to elementn of vector register B Allows for multiple hardware lanesVector ArchitecturesMultiple Lanes11

Vector length not known at compile time?Use Vector Length Register (VLR)Use strip mining for vectors over the maximum length:Vector ArchitecturesVector Length Registerlow 0;VL (n % MVL); /*find odd-size piece using modulo op % */for (j 0; j (n/MVL); j j 1) { /*outer loop*/for (i low; i (low VL); i i 1) /*runs for length VL*/Y[i] a * X[i] Y[i] ; /*main operation*/low low VL; /*start of next vector*/VL MVL; /*reset the length to maximum vector length*/}12

Consider:for (i 0; i 64; i i 1)if (X[i] ! 0)X[i] X[i] – Y[i];Use vector mask register to “disable” elements:LVLVL.DSNEVS.DSUBVV.DSV V1,RxV2,RyF0,#0V1,F0V1,V1,V2Rx,V1Vector ArchitecturesVector Mask Registers;load vector X into V1;load vector Y;load FP zero into F0;sets VM(i) to 1 if V1(i)! F0;subtract under vector mask;store the result in XGFLOPS rate decreases!13

Memory system must be designed to support highbandwidth for vector loads and storesSpread accesses across multiple banks Vector ArchitecturesMemory BanksControl bank addresses independentlyLoad or store non sequential wordsSupport multiple vector processors sharing the same memoryExample: 32 processors, each generating 4 loads and 2 stores/cycleProcessor cycle time is 2.167 ns, SRAM cycle time is 15 nsHow many memory banks needed?14

Consider:for (i 0; i 100; i i 1)for (j 0; j 100; j j 1) {A[i][j] 0.0;for (k 0; k 100; k k 1)A[i][j] A[i][j] B[i][k] * D[k][j];}Vector ArchitecturesStrideMust vectorize multiplication of rows of B with columns of DUse non-unit strideBank conflict (stall) occurs when the same bank is hit faster thanbank busy time: #banks / LCM(stride,#banks) bank busy time15

Vector ArchitecturesScatter-Gather Consider:for (i 0; i n; i i 1)A[K[i]] A[K[i]] C[M[i]];Use index vector:LVVk, RkLVIVa, (Ra Vk)LVVm, RmLVIVc, (Rc Vm)ADDVV.D Va, Va, VcSVI(Ra Vk), Va;load K;load A[K[]];load M;load C[M[]];add them;store A[K[]]16

Compilers can provide feedback to programmersProgrammers can provide hints to compilerVector ArchitecturesProgramming Vec. Architectures17

Media applications operate on data types narrower thanthe native word size Example: disconnect carry chains to “partition” adderLimitations, compared to vector instructions: Number of data operands encoded into op code No sophisticated addressing modes (strided, scattergather) No mask registersSIMD Instruction Set Extensions for MultimediaSIMD Extensions18

Implementations: Intel MMX (1996) Streaming SIMD Extensions (SSE) (1999) Eight 16-bit integer opsFour 32-bit integer/fp ops or two 64-bit integer/fp opsAdvanced Vector Extensions (2010) Eight 8-bit integer ops or four 16-bit integer opsFour 64-bit integer/fp opsSIMD Instruction Set Extensions for MultimediaSIMD ImplementationsOperands must be consecutive and aligned memorylocations19

Example DDIUDADDIUDSUBUBNEZF0,aF1, F0F2, F0F3, F0R4,Rx,#512L.4D y,Ry,#32R20,R4,RxR20,Loop;load scalar a;copy a into F1 for SIMD MUL;copy a into F2 for SIMD MUL;copy a into F3 for SIMD MUL;last address to load;load X[i], X[i 1], X[i 2], X[i 3];a X[i],a X[i 1],a X[i 2],a X[i 3];load Y[i], Y[i 1], Y[i 2], Y[i 3];a X[i] Y[i], ., a X[i 3] Y[i 3];store into Y[i], Y[i 1], Y[i 2], Y[i 3];increment index to X;increment index to Y;compute bound;check if doneSIMD Instruction Set Extensions for MultimediaExample SIMD Code20

Basic idea: Plot peak floating-point throughput as a function ofarithmetic intensity Ties together floating-point performance and memoryperformance for a target machineArithmetic intensity Floating-point operations per byte readSIMD Instruction Set Extensions for MultimediaRoofline Performance Model21

Attainable GFLOPs/sec Min (Peak Memory BW Arithmetic Intensity, Peak Floating Point Perf.)SIMD Instruction Set Extensions for MultimediaExamples22

Given the hardware invested to do graphics well,how can be supplement it to improveperformance of a wider range of applications?Graphical Processing UnitsGraphical Processing UnitsBasic idea: Heterogeneous execution model CPU is the host, GPU is the deviceDevelop a C-like programming language for GPUUnify all forms of GPU parallelism as CUDA threadProgramming model is “Single Instruction MultipleThread”23

Graphical Processing UnitsThreads and Blocks A thread is associated with each data elementThreads are organized into blocksBlocks are organized into a gridGPU hardware handles thread management, notapplications or OS24

Similarities to vector machines: Works well with data-level parallel problemsScatter-gather transfersMask registersLarge register filesGraphical Processing UnitsNVIDIA GPU ArchitectureDifferences: No scalar processorUses multithreading to hide memory latencyHas many functional units, as opposed to a fewdeeply pipelined units like a vector processor25

Multiply two vectors of length 8192 Code that works over all elements is the gridThread blocks break this down into manageable sizes 512 threads per blockGraphical Processing UnitsExampleSIMD instruction executes 32 elements at a timeThus grid size 16 blocksBlock is analogous to a strip-mined vector loop withvector length of 32Block is assigned to a multithreaded SIMD processorby the thread block schedulerCurrent-generation GPUs (Fermi) have 7-15multithreaded SIMD processors26

Threads of SIMD instructions Each has its own PCThread scheduler uses scoreboard to dispatchNo data dependencies between threads!Keeps track of up to 48 threads of SIMD instructions Graphical Processing UnitsTerminologyHides memory latencyThread block scheduler schedules blocks toSIMD processorsWithin each SIMD processor: 32 SIMD lanesWide and shallow compared to vector processorsCopyright 2012, Elsevier Inc. All rights reserved.27

NVIDIA GPU has 32,768 registers Divided into lanesEach SIMD thread is limited to 64 registersSIMD thread has up to: Graphical Processing UnitsExample64 vector registers of 32 32-bit elements32 vector registers of 32 64-bit elementsFermi has 16 physical SIMD lanes, each containing2048 registers28

ISA is an abstraction of the hardware instructionset “Parallel Thread Execution (PTX)”Uses virtual registersTranslation to machine code is performed in softwareExample:Graphical Processing UnitsNVIDIA Instruction Set Arch.shl.s32R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29)add.s32R8, R8, threadIdx ; R8 i my CUDA thread IDld.global.f64 RD0, [X R8]; RD0 X[i]ld.global.f64 RD2, [Y R8]; RD2 Y[i]mul.f64 R0D, RD0, RD4; Product in RD0 RD0 * RD4 (scalar a)add.f64 R0D, RD0, RD2; Sum in RD0 RD0 RD2 (Y[i])st.global.f64 [Y R8], RD0; Y[i] sum (X[i]*a Y[i])29

Like vector architectures, GPU branch hardware usesinternal masksAlso uses Branch synchronization stack Instruction markers to manage when a branch diverges intomultiple execution paths Push on divergent branch and when paths converge Entries consist of masks for each SIMD laneI.e. which threads commit their results (all threads execute)Graphical Processing UnitsConditional BranchingAct as barriersPops stackPer-thread-lane 1-bit predicate register, specified byprogrammer30

if (X[i] ! 0)X[i] X[i] – Y[i];else X[i] Z[i];ld.global.f64setp.neq.s32@!P1, braRD0, [X R8]P1, RD0, #0ELSE1, *Push; RD0 X[i]; P1 is predicate register 1; Push old mask, set new mask bits; if P1 false, go to ELSE1ld.global.f64RD2, [Y R8]; RD2 Y[i]sub.f64RD0, RD0, RD2; Difference in RD0st.global.f64[X R8], RD0; X[i] RD0@P1, braENDIF1, *Comp; complement mask bits; if P1 true, go to ENDIF1ELSE1:ld.global.f64 RD0, [Z R8]; RD0 Z[i]st.global.f64 [X R8], RD0; X[i] RD0ENDIF1: next instruction , *Pop; pop to restore old maskGraphical Processing UnitsExample31

Each SIMD Lane has private section of off-chip DRAM “Private memory” Contains stack frame, spilling registers, and privatevariablesEach multithreaded SIMD processor also haslocal memory Graphical Processing UnitsNVIDIA GPU Memory StructuresShared by SIMD lanes / threads within a blockMemory shared by SIMD processors is GPUMemory Host can read and write GPU memory32

Each SIMD processor has Two SIMD thread schedulers, two instruction dispatch units16 SIMD lanes (SIMD width 32, chime 2 cycles), 16 load-storeunits, 4 special function unitsThus, two threads of SIMD instructions are scheduled every twoclock cyclesGraphical Processing UnitsFermi Architecture InnovationsFast double precisionCaches for GPU memory64-bit addressing and unified address spaceError correcting codesFaster context switchingFaster atomic instructions33

Graphical Processing UnitsFermi Multithreaded SIMD Proc.34

Focuses on determining whether data accesses in lateriterations are dependent on data values produced inearlier iterations Loop-carried dependenceExample 1:for (i 999; i 0; i i-1)x[i] x[i] s; Detecting and Enhancing Loop-Level ParallelismLoop-Level ParallelismNo loop-carried dependence35

Example 2:for (i 0; i 100; i i 1) {A[i 1] A[i] C[i]; /* S1 */B[i 1] B[i] A[i 1]; /* S2 */} S1 and S2 use values computed by S1 inprevious iterationS2 uses value computed by S1 in same iterationDetecting and Enhancing Loop-Level ParallelismLoop-Level Parallelism36

Example 3:for (i 0; i 100; i i 1) {A[i] A[i] B[i]; /* S1 */B[i 1] C[i] D[i]; /* S2 */}S1 uses value computed by S2 in previous iteration but dependenceis not circular so loop is parallelTransform to:A[0] A[0] B[0];for (i 0; i 99; i i 1) {B[i 1] C[i] D[i];A[i 1] A[i 1] B[i 1];}B[100] C[99] D[99];Detecting and Enhancing Loop-Level ParallelismLoop-Level Parallelism37

Example 4:for (i 0;i 100;i i 1) {A[i] B[i] C[i];D[i] A[i] * E[i];}Example 5:for (i 1;i 100;i i 1) {Y[i] Y[i-1] Y[i];}Detecting and Enhancing Loop-Level ParallelismLoop-Level Parallelism38

Assume indices are affine: a x i b (i is loop index)Assume: Store to a x i b, thenLoad from c x i di runs from m to nDependence exists if: Given j, k such that m j n, m k nStore to a x j b, load from a x k d, and a x j b c x k dDetecting and Enhancing Loop-Level ParallelismFinding dependencies39

Generally cannot determine at compile timeTest for absence of a dependence: GCD test: If a dependency exists, GCD(c,a) must evenly divide (d-b)Example:for (i 0; i 100; i i 1) {X[2*i 3] X[2*i] * 5.0;}Detecting and Enhancing Loop-Level ParallelismFinding dependencies40

Example 2:for (i 0; i 100; i i 1) {Y[i] X[i] / c; /* S1 */X[i] X[i] c; /* S2 */Z[i] Y[i] c; /* S3 */Y[i] c - Y[i]; /* S4 */} Watch for antidependencies and outputdependenciesDetecting and Enhancing Loop-Level ParallelismFinding dependencies41

Example 2:for (i 0; i 100; i i 1) {Y[i] X[i] / c; /* S1 */X[i] X[i] c; /* S2 */Z[i] Y[i] c; /* S3 */Y[i] c - Y[i]; /* S4 */} Watch for antidependencies and outputdependenciesDetecting and Enhancing Loop-Level ParallelismFinding dependencies42

Reduction Operation:for (i 9999; i 0; i i-1)sum sum x[i] * y[i];Transform to for (i 9999; i 0; i i-1)sum [i] x[i] * y[i];for (i 9999; i 0; i i-1)finalsum finalsum sum[i];Do on p processors:for (i 999; i 0; i i-1)finalsum[p] finalsum[p] sum[i 1000*p];Note: assumes associativity!Detecting and Enhancing Loop-Level ParallelismReductions43

Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A Quantitative Approach, Fifth Edition. 2 Introduction SIMD architectures can exploit significant data-level parallelism for: matrix-oriented scientific computing media-oriented image and sound processors