Optimizing Parallel Reduction In CUDA - Nvidia

Transcription

Optimizing Parallel Reduction in CUDAMark HarrisNVIDIA Developer Technology

Parallel ReductionCommon and important data parallel primitiveEasy to implement in CUDAHarder to get it rightServes as a great optimization exampleWe’ll walk step by step through 7 different versionsDemonstrates several important optimization strategies2

Parallel ReductionTree-based approach used within each thread block317404716511391425Need to be able to use multiple thread blocksTo process very large arraysTo keep all multiprocessors on the GPU busyEach thread block reduces a portion of the arrayBut how do we communicate partial results betweenthread blocks?3

Problem: Global SynchronizationIf we could synchronize across all thread blocks, could easilyreduce very large arrays, right?Global sync after each block produces its resultOnce all blocks reach sync, continue recursivelyBut CUDA has no global synchronization. Why?Expensive to build in hardware for GPUs with high processorcountWould force programmer to run fewer blocks (no more than #multiprocessors * # resident blocks / multiprocessor) to avoiddeadlock, which may reduce overall efficiencySolution: decompose into multiple kernelsKernel launch serves as a global synchronization pointKernel launch has negligible HW overhead, low SW overhead4

Solution: Kernel DecompositionAvoid global sync by decomposing computationinto multiple kernel invocations3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 3 3 1 7 0 4 1 6 11411141114111425252525252525253 1 7 0 4 1 6 34759111425Level 0:8 blocksLevel 1:1 blockIn the case of reductions, code for all levels is thesameRecursive kernel invocation5

What is Our Optimization Goal?We should strive to reach GPU peak performanceChoose the right metric:GFLOP/s: for compute-bound kernelsBandwidth: for memory-bound kernelsReductions have very low arithmetic intensity1 flop per element loaded (bandwidth-optimal)Therefore we should strive for peak bandwidthWill use G80 GPU for this example384-bit memory interface, 900 MHz DDR384 * 1800 / 8 86.4 GB/s6

Reduction #1: Interleaved Addressingglobal void reduce0(int *g idata, int *g odata) {extern shared int sdata[];// each thread loads one element from global to shared memunsigned int tid threadIdx.x;unsigned int i blockIdx.x*blockDim.x threadIdx.x;sdata[tid] g idata[i];syncthreads();// do reduction in shared memfor(unsigned int s 1; s blockDim.x; s * 2) {if (tid % (2*s) 0) {sdata[tid] sdata[tid s];}syncthreads();}// write result for this block to global memif (tid 0) g odata[blockIdx.x] sdata[0];}7

Parallel Reduction: Interleaved AddressingValues (shared memory) 10Step 1Stride 1Step 2Stride 2Step 3Stride 4Step 4Stride 17-16-28517-39713112217-16-28517-3971311228

Reduction #1: Interleaved Addressingglobal void reduce1(int *g idata, int *g odata) {extern shared int sdata[];// each thread loads one element from global to shared memunsigned int tid threadIdx.x;unsigned int i blockIdx.x*blockDim.x threadIdx.x;sdata[tid] g idata[i];syncthreads();// do reduction in shared memfor (unsigned int s 1; s blockDim.x; s * 2) {if (tid % (2*s) 0) {Problem: highly divergentsdata[tid] sdata[tid s];warps are very inefficient, and}% operator is very slowsyncthreads();}// write result for this block to global memif (tid 0) g odata[blockIdx.x] sdata[0];}9

Performance for 4M element reductionTime (222 ints)Kernel 1:8.054 msBandwidth2.083 GB/sinterleaved addressingwith divergent branchingNote: Block Size 128 threads for all tests10

Reduction #2: Interleaved AddressingJust replace divergent branch in inner loop:for (unsigned int s 1; s blockDim.x; s * 2) {if (tid % (2*s) 0) {sdata[tid] sdata[tid s];}syncthreads();}With strided index and non-divergent branch:for (unsigned int s 1; s blockDim.x; s * 2) {int index 2 * s * tid;if (index blockDim.x) {sdata[index] sdata[index s];}syncthreads();}11

Parallel Reduction: Interleaved AddressingValues (shared memory) 10Step 1Stride 1Step 2Stride 2Step 3Stride 4Step 4Stride 6-28517-39713112217-16-28517-397131122New Problem: Shared Memory Bank Conflicts12

Performance for 4M element reduction(222 ints)Bandwidth8.054 ms2.083 GB/s3.456 ms4.854 GB/sTimeKernel 1:interleaved addressingwith divergent branchingKernel 2:interleaved addressingwith bank conflictsStepCumulativeSpeedup Speedup2.33x2.33x13

Parallel Reduction: Sequential AddressingValues (shared memory) 10Step 1Stride 8Step 2Stride 4Step 3Stride 2Step 4Stride 701102937-2-32701102937-2-32701102Sequential addressing is conflict free14

Reduction #3: Sequential AddressingJust replace strided indexing in inner loop:for (unsigned int s 1; s blockDim.x; s * 2) {int index 2 * s * tid;if (index blockDim.x) {sdata[index] sdata[index s];}syncthreads();}With reversed loop and threadID-based indexing:for (unsigned int s blockDim.x/2; s 0; s 1) {if (tid s) {sdata[tid] sdata[tid s];}syncthreads();}15

Performance for 4M element reductionBandwidth8.054 ms2.083 GB/s3.456 ms4.854 GB/s2.33x2.33x1.722 ms9.741 GB/s2.01x4.68xTimeKernel 1:interleaved addressingwith divergent branchingKernel 2:interleaved addressingwith bank conflictsKernel 3:sequential addressingStepCumulativeSpeedup Speedup(222 ints)16

Idle ThreadsProblem:for (unsigned int s blockDim.x/2; s 0; s 1) {if (tid s) {sdata[tid] sdata[tid s];}syncthreads();}Half of the threads are idle on first loop iteration!This is wasteful 17

Reduction #4: First Add During LoadHalve the number of blocks, and replace single load:// each thread loads one element from global to shared memunsigned int tid threadIdx.x;unsigned int i blockIdx.x*blockDim.x threadIdx.x;sdata[tid] g idata[i];syncthreads();With two loads and first add of the reduction:// perform first level of reduction,// reading from global memory, writing to shared memoryunsigned int tid threadIdx.x;unsigned int i blockIdx.x*(blockDim.x*2) threadIdx.x;sdata[tid] g idata[i] g idata[i blockDim.x];syncthreads();18

Performance for 4M element reductionBandwidth8.054 ms2.083 GB/s3.456 ms4.854 GB/s2.33x2.33x1.722 ms9.741 GB/s2.01x4.68x0.965 ms 17.377 GB/s1.78x8.34xTimeKernel 1:interleaved addressingwith divergent branchingKernel 2:interleaved addressingwith bank conflictsKernel 3:sequential addressingKernel 4:first add during global loadStepCumulativeSpeedup Speedup(222 ints)19

Instruction BottleneckAt 17 GB/s, we’re far from bandwidth boundAnd we know reduction has low arithmetic intensityTherefore a likely bottleneck is instruction overheadAncillary instructions that are not loads, stores, orarithmetic for the core computationIn other words: address arithmetic and loop overheadStrategy: unroll loops20

Unrolling the Last WarpAs reduction proceeds, # “active” threads decreasesWhen s 32, we have only one warp leftInstructions are SIMD synchronous within a warpThat means when s 32:We don’t need to syncthreads()We don’t need “if (tid s)” because it doesn’t save anyworkLet’s unroll the last 6 iterations of the inner loop21

Reduction #5: Unroll the Last Warpdevice void warpReduce(volatile int* sdata, int tid) {sdata[tid] sdata[tid 32];sdata[tid] sdata[tid 16];IMPORTANT:sdata[tid] sdata[tid 8];For this to be correct,sdata[tid] sdata[tid 4];we must use thesdata[tid] sdata[tid 2];“volatile” keyword!sdata[tid] sdata[tid 1];}// later for (unsigned int s blockDim.x/2; s 32; s 1) {if (tid s)sdata[tid] sdata[tid s];syncthreads();}if (tid 32) warpReduce(sdata, tid);Note: This saves useless work in all warps, not just the last one!Without unrolling, all warps execute every iteration of the for loop and if statement22

Performance for 4M element reductionBandwidth8.054 ms2.083 GB/s3.456 ms4.854 GB/s2.33x2.33x1.722 ms9.741 GB/s2.01x4.68x0.965 ms 17.377 GB/s1.78x8.34x0.536 ms 31.289 GB/s1.8x15.01xTimeKernel 1:interleaved addressingwith divergent branchingKernel 2:interleaved addressingwith bank conflictsKernel 3:sequential addressingKernel 4:first add during global loadKernel 5:unroll last warpStepCumulativeSpeedup Speedup(222 ints)23

Complete UnrollingIf we knew the number of iterations at compile time,we could completely unroll the reductionLuckily, the block size is limited by the GPU to 512 threadsAlso, we are sticking to power-of-2 block sizesSo we can easily unroll for a fixed block sizeBut we need to be generic – how can we unroll for blocksizes that we don’t know at compile time?Templates to the rescue!CUDA supports C template parameters on device andhost functions24

Unrolling with TemplatesSpecify block size as a function template parameter:template unsigned int blockSize global void reduce5(int *g idata, int *g odata)25

Reduction #6: Completely UnrolledTemplate unsigned int blockSize device void warpReduce(volatile int* sdata, int tid) {if (blockSize 64) sdata[tid] sdata[tid 32];if (blockSize 32) sdata[tid] sdata[tid 16];if (blockSize 16) sdata[tid] sdata[tid 8];if (blockSize 8) sdata[tid] sdata[tid 4];if (blockSize 4) sdata[tid] sdata[tid 2];if (blockSize 2) sdata[tid] sdata[tid 1];}if (blockSize 512) {if (tid 256) { sdata[tid] sdata[tid 256]; } syncthreads(); }if (blockSize 256) {if (tid 128) { sdata[tid] sdata[tid 128]; } syncthreads(); }if (blockSize 128) {if (tid 64) { sdata[tid] sdata[tid 64]; } syncthreads(); }if (tid 32) warpReduce blockSize (sdata, tid);Note: all code in RED will be evaluated at compile time.Results in a very efficient inner loop!26

Invoking Template KernelsDon’t we still need block size at compile time?Nope, just a switch statement for 10 possible block sizes:switch (threads){case 512:reduce5 512 dimGrid, dimBlock, smemSize (d idata, d odata); break;case 256:reduce5 256 dimGrid, dimBlock, smemSize (d idata, d odata); break;case 128:reduce5 128 dimGrid, dimBlock, smemSize (d idata, d odata); break;case 64:reduce5 64 dimGrid, dimBlock, smemSize (d idata, d odata); break;case 32:reduce5 32 dimGrid, dimBlock, smemSize (d idata, d odata); break;case 16:reduce5 16 dimGrid, dimBlock, smemSize (d idata, d odata); break;case 8:reduce5 8 dimGrid, dimBlock, smemSize (d idata, d odata); break;case 4:reduce5 4 dimGrid, dimBlock, smemSize (d idata, d odata); break;case 2:reduce5 2 dimGrid, dimBlock, smemSize (d idata, d odata); break;case 1:reduce5 1 dimGrid, dimBlock, smemSize (d idata, d odata); break;}27

Performance for 4M element reductionBandwidth8.054 ms2.083 GB/s3.456 ms4.854 GB/s2.33x2.33x1.722 ms9.741 GB/s2.01x4.68x0.965 ms 17.377 GB/s1.78x8.34x0.536 ms 31.289 GB/s1.8x15.01x0.381 ms 43.996 GB/s1.41x21.16xTimeKernel 1:interleaved addressingwith divergent branchingKernel 2:interleaved addressingwith bank conflictsKernel 3:sequential addressingKernel 4:first add during global loadKernel 5:unroll last warpKernel 6:completely unrolledStepCumulativeSpeedup Speedup(222 ints)28

Parallel Reduction ComplexityLog(N) parallel steps, each step S does N/2Sindependent opsStep Complexity is O(log N)For N 2D, performs S [1.D]2D-S N-1 operationsWork Complexity is O(N) – It is work-efficienti.e. does not perform more operations than a sequentialalgorithmWith P threads physically in parallel (P processors),time complexity is O(N/P log N)Compare to O(N) for sequential reductionIn a thread block, N P, so O(log N)29

What About Cost?Cost of a parallel algorithm is processorscomplexitytimeAllocate threads instead of processors: O(N) threadsTime complexity is O(log N), so cost is O(N log N) : notcost efficient!Brent’s theorem suggests O(N/log N) threadsEach thread does O(log N) sequential workThen all O(N/log N) threads cooperate for O(log N) stepsCost O((N/log N) * log N) O(N) cost efficientSometimes called algorithm cascadingCan lead to significant speedups in practice30

Algorithm CascadingCombine sequential and parallel reductionEach thread loads and sums multiple elements intoshared memoryTree-based reduction in shared memoryBrent’s theorem says each thread should sumO(log n) elementsi.e. 1024 or 2048 elements per block vs. 256In my experience, beneficial to push it even furtherPossibly better latency hiding with more work per threadMore threads per block reduces levels in tree of recursivekernel invocationsHigh kernel launch overhead in last levels with few blocksOn G80, best perf with 64-256 blocks of 128 threads1024-4096 elements per thread31

Reduction #7: Multiple Adds / ThreadReplace load and add of two elements:unsigned int tid threadIdx.x;unsigned int i blockIdx.x*(blockDim.x*2) threadIdx.x;sdata[tid] g idata[i] g idata[i blockDim.x];syncthreads();With a while loop to add as many as necessary:unsigned int tid threadIdx.x;unsigned int i blockIdx.x*(blockSize*2) threadIdx.x;unsigned int gridSize blockSize*2*gridDim.x;sdata[tid] 0;while (i n) {sdata[tid] g idata[i] g idata[i blockSize];i gridSize;}syncthreads();32

Reduction #7: Multiple Adds / ThreadReplace load and add of two elements:unsigned int tid threadIdx.x;unsigned int i blockIdx.x*(blockDim.x*2) threadIdx.x;sdata[tid] g idata[i] g idata[i blockDim.x];syncthreads();With a while loop to add as many as necessary:unsigned int tid threadIdx.x;unsigned int i blockIdx.x*(blockSize*2) threadIdx.x;Note: gridSize loop strideunsigned int gridSize blockSize*2*gridDim.x;to maintain coalescing!sdata[tid] 0;while (i n) {sdata[tid] g idata[i] g idata[i blockSize];i gridSize;}syncthreads();33

Performance for 4M element reductionBandwidth8.054 ms2.083 GB/s3.456 ms4.854 GB/s2.33x2.33x1.722 ms9.741 GB/s2.01x4.68x0.965 ms 17.377 GB/s1.78x8.34x0.536 ms 31.289 GB/s1.8x15.01x0.381 ms 43.996 GB/s1.41x21.16x0.268 ms 62.671 GB/s1.42x30.04xTimeKernel 1:interleaved addressingwith divergent branchingKernel 2:interleaved addressingwith bank conflictsKernel 3:sequential addressingKernel 4:first add during global loadKernel 5:unroll last warpKernel 6:completely unrolledKernel 7:multiple elements per threadStepCumulativeSpeedup Speedup(222 ints)Kernel 7 on 32M elements: 73 GB/s!34

template unsigned int blockSize device void warpReduce(volatile int *sdata, unsigned int tid) {if (blockSize 64) sdata[tid] sdata[tid 32];if (blockSize 32) sdata[tid] sdata[tid 16];if (blockSize 16) sdata[tid] sdata[tid 8];if (blockSize 8) sdata[tid] sdata[tid 4];Final Optimizedif (blockSize 4) sdata[tid] sdata[tid 2];if (blockSize 2) sdata[tid] sdata[tid 1];}template unsigned int blockSize global void reduce6(int *g idata, int *g odata, unsigned int n) {extern shared int sdata[];unsigned int tid threadIdx.x;unsigned int i blockIdx.x*(blockSize*2) tid;unsigned int gridSize blockSize*2*gridDim.x;sdata[tid] 0;Kernelwhile (i n) { sdata[tid] g idata[i] g idata[i blockSize]; i gridSize; }syncthreads();if (blockSize 512) { if (tid 256) { sdata[tid] sdata[tid 256]; } syncthreads(); }if (blockSize 256) { if (tid 128) { sdata[tid] sdata[tid 128]; } syncthreads(); }if (blockSize 128) { if (tid 64) { sdata[tid] sdata[tid 64]; } syncthreads(); }if (tid 32) warpReduce(sdata, tid);if (tid 0) g odata[blockIdx.x] sdata[0];}35

Performance Comparison101: Interleaved Addressing:Divergent Branches2: Interleaved Addressing:Bank Conflicts3: Sequential Addressing1Time (ms)4: First add during globalload5: Unroll last warp6: Completely unroll0.17: Multiple elements perthread (max 64 62821436074887755831633# Elements36

Types of optimizationInteresting observation:Algorithmic optimizationsChanges to addressing, algorithm cascading11.84x speedup, combined!Code optimizationsLoop unrolling2.54x speedup, combined37

ConclusionUnderstand CUDA performance characteristicsMemory coalescingDivergent branchingBank conflictsLatency hidingUse peak performance metrics to guide optimizationUnderstand parallel algorithm complexity theoryKnow how to identify type of bottlenecke.g. memory, core computation, or instruction overheadOptimize your algorithm, then unroll loopsUse template parameters to generate optimal codeQuestions: mharris@nvidia.com38

2 Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as a great optimization example