Implementing & Tuning Irregular Programs On GPUs GTC 2013

Transcription

Implementing and Tuning IrregularPrograms on GPUsMartin BurtscherRupesh NasreComputer ScienceComputational Engineering and SciencesTexas State University-San MarcosThe University of Texas at Austin

GPU Advantages over CPUs Comparing GTX 680 to Xeon E5-2687W Both released in March 2012 Peak performance and main-memory bandwidth 8x as many operations executed per second 4x as many bytes transferred per second Cost-, energy-, and area-efficiencyNvidia 29x as much performance per dollar 6x as much performance per watt 11x as much performance per areaImplementing and Tuning Irregular Programs on GPUs2Intel

Codes Suitable for GPU Acceleration Clearly, we should be using GPUs all the time So why aren’t we? GPUs can only accelerate some types of codes Need lots of parallelism, data reuse, and regularity (inmemory accesses and control flow) Mostly regular codes have been ported to GPUs E.g., matrix codes executing many ops/wordDense matrix operations (Level 2 and 3 BLAS) Stencil codes (PDE solvers) LLNLImplementing and Tuning Irregular Programs on GPUs3

Codes Suitable for GPU Acceleration Clearly, we should be using GPUs all the time So why aren’t we?Our goal is to also handleirregular codes well GPUs can only accelerate some types of codes Need lots of parallelism, data reuse, and regularity (inmemory accesses and control flow) Mostly regular codes have been ported to GPUs E.g., matrix codes executing many ops/wordDense matrix operations (Level 2 and 3 BLAS) Stencil codes (PDE solvers) LLNLImplementing and Tuning Irregular Programs on GPUs4

Regular Programs Typically operate on arrays and matrices Data is processed in fixed-iteration FOR loops Have statically predictable behavior Exhibit mostly strided memory access patterns Control flow is mainly determined by input size Data dependencies are static and not loop carried Examplefor (i 0; i size; i ) {c[i] a[i] b[i];}Implementing and Tuning Irregular Programs on GPUswikipedia5

Irregular Programs Are important and widely used Social network analysis, data clustering/partitioning,discrete-event simulation, operations research,meshing, SAT solving, n-body simulation, etc. Typically operate on dynamic data structures Graphs, trees, linked lists, priority queues, etc. Data is processed in variable-iteration WHILE loopstripodwikipediaImplementing and Tuning Irregular Programs on GPUs6

Irregular Programs (cont.) Have statically unpredictable behavior Exhibit pointer-chasing memory access patterns Control flow depends on input values and may change Data dependences have to be detected dynamically Examplewhile (pos ! end) {v workset[pos ];for (i 0; i count[v]; i ){n neighbor[index[v] i];if (process(v,n)) workset[end ] n;} }LANLImplementing and Tuning Irregular Programs on GPUs7

GPU Implementation Challenges Indirect and irregular memory accesses Little or no coalescing [low bandwidth] Memory-bound pointer chasing Little locality and computation [exposed latency] Dynamically changing irregular control flow Thread divergence [loss of parallelism] Input dependent and changing data parallelism Load imbalance [loss of parallelism]Naïve implementation results in poor performanceImplementing and Tuning Irregular Programs on GPUs8

Outline Introduction Irregular codes and basic optimizations Irregular optimizations Performance results General guidelines SummaryImplementing and Tuning Irregular Programs on GPUs9

Our Irregular Programs Mostly taken from LonestarGPU benchmark suite Set of currently seven irregular CUDA codeshttp://iss.ices.utexas.edu/?p Breadth-first searchBHBarnes-Hut n-body simulationDMRLines ofCodeNumber ofkernels27521,0699Delaunay mesh refinement9584MSTMinimum spanning tree4468PTAPoints-to analysis3,49440SPSurvey propagation4453SSSPSingle-source shortest paths6142Implementing and Tuning Irregular Programs on GPUs10

Primary Data Structures Road networks BFS, Max-flow, MST, and SSSPiop Triangulated meshes DMR Control-flow graphs PTAwikipedia Unbalanced octreescalpoly BH Bi-partite graphssciencedirect SPucdenverImplementing and Tuning Irregular Programs on GPUs11

Basic CUDA Code Optimizations Our codes include many conventional optimizations Use structure of arrays, keep data on GPU, improvememory layout, cache data in registers, re-computeinstead of re-load values, unroll loops, chunk work,employ persistent threads, and use compiler flags Conventional parallelization optimizations Use light-weight locking, atomics, and lock-free code;minimize locking, memory fences, and volatile accesses Conventional architectural optimizations Utilize shared memory, constant memory, streams, threadvoting, and rsqrtf; detect compute capability and numberof SMs; tune thread count, blocks per SM, launch bounds,and L1 cache/shared-memory configurationImplementing and Tuning Irregular Programs on GPUs12

Outline Introduction Irregular codes and basic optimizations Irregular optimizations Performance results General guidelines SummaryImplementing and Tuning Irregular Programs on GPUs13

Irregular CUDA Code ynchronizationPush versus pullBFS, PTA, SSSPCombined memory fencesBH-summ, BH-treeWait-free pre-passBH-summKernel unrollingBFS, SSSPWarp-centric executionBH-force, PTAComputation onto traversalBH-sort, BH-summAlgorithm choiceMax-flow, SSSPImplementation adaptationMST, SSSPAlgebraic propertiesBFS, SP, SSSPSorting workBH-force, DMR, PTADynamic configurationDMR, PTAKernel fusion and fissionDMR, MST, SPCommunication onto computationBH, PTAMemoryAlgorithmSchedulingImplementing and Tuning Irregular Programs on GPUs14

Push versus Pull1SynchronizationPush versus pullBFS, PTA, SSSPCombined memory fencesBH-summ, BH-treeWait-free pre-passBH-summ Information may flow in a push- or pull-based way Push-based data-driven code is work-efficient Pull-based code minimizes synchronization In BFS and SSSP, we use a push-based model In PTA, we use a pull-based model Similar to scatter/gatherabImplementing and Tuning Irregular Programs on GPUspush-basedpull-based15

Combined Memory Fences1SynchronizationPush versus pullBFS, PTA, SSSPCombined memory fencesBH-summ, BH-treeWait-free pre-passBH-summ Memory fence operations can be slow Combining can help even with some load imbalance In BH-tree, the block-threads create subtrees,wait at a syncthreads, and install the subtrees In BH-summ, the block-threads summarize childinfo, run a syncthreads, then update the parentsImplementing and Tuning Irregular Programs on GPUs16

Wait-free Pre-pass1SynchronizationPush versus pullBFS, PTA, SSSPCombined memory fencesBH-summ, BH-treeWait-free pre-passBH-summ Use pre-pass to process ready consumers first Instead of waiting, skip over unready consumers Follow up with a ‘waitful’ pass in the end BH uses multiple pre-passes in summarization Due to high fan-out of octree, most of work is done inpre-passes, thus minimizing the waiting in final passImplementing and Tuning Irregular Programs on GPUs17

Kernel Unrolling2MemoryKernel unrollingBFS, SSSPWarp-centric executionBH-force, PTAComputation onto traversalBH-sort, BH-summ Kernel unrolling Instead of operating on a single graph element, athread operates on a (connected) subgraph Improves locality and latency-hiding ability Helps propagate information faster through graph Useful for memory-bound kernels like BFS & SSSPImplementing and Tuning Irregular Programs on GPUs18

Warp-centric Execution2MemoryKernel unrollingBFS, SSSPWarp-centric executionBH-force, PTAComputation onto traversalBH-sort, BH-summ Forcing warps to stay synchronized in irreg code Eliminates divergence and enables coalescing BH traverses entire warp’s union of tree prefixes Divergence-free traversal & only need per-warp data PTA uses lists with 32 words per element Enables fast set union & fully coalesced data accessesImplementing and Tuning Irregular Programs on GPUs19

Computation onto Traversal2MemoryKernel unrollingBFS, SSSPWarp-centric executionBH-force, PTAComputation onto traversalBH-sort, BH-summ Combines multiple passes over data structure In BH-sort, top-down traversal sorts bodies andmoves non-null children to front of child-array In BH-summ, bottom-up traversal computescenter of gravity and counts number of bodies insubtreesImplementing and Tuning Irregular Programs on GPUs20

Algorithm Choice3AlgorithmicAlgorithm choiceMax-flow, SSSPImplementation adaptationMST, SSSPAlgebraic propertiesBFS, SP, SSSP Best algorithm choice is architecture dependent For Max-flow, push-relabel algorithm is better forCPUs whereas MPM works better on GPUs For SSSP, Bellman-Ford algorithm is preferentialon GPUs whereas Dijkstra’s algorithm yieldsbetter serial performance on CPUsImplementing and Tuning Irregular Programs on GPUs21

Implementation Adaptation3AlgorithmicAlgorithm choiceMax-flow, SSSPImplementation adaptationMST, SSSPAlgebraic propertiesBFS, SP, SSSP Implementation may have to be adjusted for GPU Array-based graph representations Topology- instead of data-driven implementation For MST, do not explicitly contract edges Keep graph static and record which nodes are merged For SSSP, use topology-driven implementation Avoided synchronization outweighs work inefficiencyImplementing and Tuning Irregular Programs on GPUs22

Algebraic Properties3AlgorithmicAlgorithm choiceMax-flow, SSSPImplementation adaptationMST, SSSPAlgebraic propertiesBFS, SP, SSSP Exploiting monotonicity property Enables atomic-free topology-driven implementation In BFS and SSSP, the node distances decreasemonotonically; in SP, the product of 342310237235Implementing and Tuning Irregular Programs on GPUsAtomic-free updateLost-update problem23Correction by topology-driven processing,exploiting monotonicity

Sorting Work4SchedulingSorting workBH-force, DMR, PTADynamic configurationDMR, PTAKernel fusion and fissionDMR, MST, SPCommunication onto computationBH, PTA Each warp-thread should do similar work Sorting minimizes divergence and load imbalance In BH, bodies are sorted by spatial location; inDMR, triangles are sorted by badness; in PTA,nodes are sorted by in-degrees01229 30.Implementing and Tuning Irregular Programs on GPUs3101229.30.31.24

Dynamic Configuration4SchedulingSorting workBH-force, DMR, PTADynamic configurationDMR, PTAKernel fusion and fissionDMR, MST, SPCommunication onto computationBH, PTA Computation should follow parallelism profile Fewer threads fewer conflicts & less sync overhead In DMR, the number of aborts due to conflictsreduces from 60% and 35% in the first twoiterations to 1% and 4.4% PTA is similarImplementing and Tuning Irregular Programs on GPUs25

Kernel Fusion and Fission4SchedulingSorting workBH-force, DMR, PTADynamic configurationDMR, PTAKernel fusion and fissionDMR, MST, SPCommunication onto computationBH, PTA Kernel fusion Fast data-sharing and reduced invocation overhead Kernel fission Enables individually tuned kernel configurations DMR uses fusion to transfer connectivity info MST uses fission for various computationsImplementing and Tuning Irregular Programs on GPUs26

Communication onto Computation4SchedulingSorting workBH-force, DMR, PTADynamic configurationDMR, PTAKernel fusion and fissionDMR, MST, SPCommunication onto computationBH, PTA Overlap CPU/GPU transfer with GPU computation In BH, the bodies’ positions can be sent to theCPU while the next time step starts running In PTA, points-to information is incrementallycopied to the CPU while the GPU is computingmore points-to informationImplementing and Tuning Irregular Programs on GPUs27

Outline Introduction Irregular codes and basic optimizations Irregular optimizations Performance results General guidelines SummaryImplementing and Tuning Irregular Programs on GPUs28

Experimental Methodology Quadro 6000 GPU, 1.45 GHz, 6 GB, 448 PEs nvcc compiler v4.2 with -arch sm 20 -O3 flags LonestarGPU program inputs BFS, Max-flow, MST, SSSP: road networks, RMAT graphs, and random graphsBH: random star clusters based on Plummer modelDMR: randomly generated 2D meshesPTA: constraint graphs from open-source C/C progsSP: randomly generated hard 3-SAT formulae Base case: ported optimized multi-core CPU code Best case: includes applicable irreg. optimizationsImplementing and Tuning Irregular Programs on GPUs29

Overall Speedup over Naïve vimtshark1.11.5SP1M4M5.55.5SSSPRMAT20USA297.4196.8 Irregular optimizations can be essential Performance improves by up to 2 orders of magnitudeImplementing and Tuning Irregular Programs on GPUs30

Push versus Pull1SynchronizationPush versus pullBFS, PTA, SSSPCombined memory fencesBH-summ, BH-treeWait-free pre-passBH-summ SSSP on various graphs Push (scatter) is better sinceit propagates info faster Benefit is higher for denserRMAT graph Both the push and pullcodes are atomic-freeImplementing and Tuning Irregular Programs on GPUs31

Combined Memory Fences1SynchronizationPush versus pullBFS, PTA, SSSPCombined memory fencesBH-summ, BH-treeWait-free pre-passBH-summ BH octree building on star clusters Combining multiple memory fences into a singlesyncthreads 37% to 44% fasterdespite barrierinduced waitingImplementing and Tuning Irregular Programs on GPUs32

Kernel Unrolling2MemoryKernel unrollingBFS, SSSPWarp-centric executionBH-force, PTAComputation onto traversalBH-sort, BH-summ SSSP on various graphs Increasing unroll factorimproves computationto-latency ratio Almost halves runningtime for road networksImplementing and Tuning Irregular Programs on GPUs33

Warp-centric Execution2MemoryKernel unrollingBFS, SSSPWarp-centric executionBH-force, PTAComputation onto traversalBH-sort, BH-summ BH force calculation onstar clusters Warp-based executionenables coalescing anditeration-stack sharing About 8x speedupImplementing and Tuning Irregular Programs on GPUs34

Implementation Adaptation3AlgorithmicAlgorithm choiceMax-flow, SSSPImplementation adaptationMST, SSSPAlgebraic propertiesBFS, SP, SSSP SSSP on various graphs Work inefficient topology-driven implementation Enables synchronizationfree implementation Outperforms data-drivenImplementing and Tuning Irregular Programs on GPUs35

Algebraic Properties3AlgorithmicAlgorithm choiceMax-flow, SSSPImplementation adaptationMST, SSSPAlgebraic propertiesBFS, SP, SSSP SP for hard 3-SAT inputs Exploits monotonicity toavoid atomics Performance benefitincreases for largerproblem sizesImplementing and Tuning Irregular Programs on GPUs36

Sorting Work4SchedulingSorting workBH-force, DMR, PTADynamic configurationDMR, PTAKernel fusion and fissionDMR, MST, SPCommunication onto computationBH, PTA BH force calculation andsorting on star clusters Sorting reduces size ofunion of tree prefixes thatwarp threads traverse 7.4x to 8.7x speedupImplementing and Tuning Irregular Programs on GPUs37

Dynamic Configuration4SchedulingSorting workBH-force, DMR, PTADynamic configurationDMR, PTAKernel fusion and fissionDMR, MST, SPCommunication onto computationBH, PTA DMR on random mesh Dynamic kernel configuration lowers abort ratiofrom 60% to 1%in first iteration Overall performanceimproves by 14%Implementing and Tuning Irregular Programs on GPUs38

GPU/CPU Performance ComparisonSurvey PropagationBarnes Hut100010000Multi-coreMulti-coreRunning time (s)Running time (s)GPU10010GPU10001001101248161Number of threads4816Number of threadsDelaunay Mesh RefinementSingle Source Shortest Path100100Multi-coreMulti-coreGPUGPURunning time (s)Running time (s)2101010.11124816Number of UsInputs:BH : and5M stars(Plummermodel),1 timeDMR : 10M triangles, 4.7M bad triangles124816Number of threadsSP : 1M literals, 4.2M clausesSSSP : 23M nodes, 57M edges39

Outline Introduction Irregular codes and basic optimizations Irregular optimizations Performance results General guidelines SummaryImplementing and Tuning Irregular Programs on GPUs40

CPU/GPU Implementation Differences Irregular CPU code Dynamically (increm.) allocated data structStructure-based datastructuresLogical lock-basedimplementationGlobal/local worklistsRecursive or iterativeimplementationImplementing and Tuning Irregular Programs on GPUs Irregular GPU code Statically (wholly) allocated data structMultiple-array-baseddata structuresLock-freeimplementationLocal or no worklistsIterativeimplementation41

Exploit Irregularity-friendly Features Massive multithreading Ideal for hiding latency ofirregular mem. accesses Wide parallelism Great for exploiting largeamounts of parallelism Shared memory Useful for local worklists Fast data sharing Fast thread startup Essential when launchingthousands of threadsImplementing and Tuning Irregular Programs on GPUs HW support for reductionand synchronization Makes otherwise costlyoperations very fast Lockstep execution Can share data withoutexplicit synchronization Allows to consolidateiteration stacks Coalesced accesses Access combining is usefulin irregular codes42

Traits of Efficient Irregular GPU Codes Mandatory Important Need vast amounts of Scheduling is independentdata parallelism Can do large chunks ofcomputation on GPUof previous activities Easy to sort activities bysimilarity (if needed) Very Important Beneficial Cautious implementation DS can be expressed Easy to express iteratively Has statically knownthrough fixed arrays Uses local (or implicit)worklists that can bestatically populatedrange of neighborhoods DS size (or bound) can bedetermined based oninputImplementing and Tuning Irregular Programs on GPUs43

Mandatory Traits Need vast amounts of data parallelism Small inputs are not worth running on the GPU Many ordered algorithms do not qualify Can do large chunks of computation on GPU Time-intensive parts of algorithm must run on GPU Avoids frequent CPU-GPU memory transfersImplementing and Tuning Irregular Programs on GPUs44

Very Important Traits Cautious implementation No rollback mechanism is needed to handle conflicts Data struct can be expressed through fixed arrays Avoids slow dynamic memory allocation Guarantees contiguous memory Uses local (or implicit) worklists that can bestatically populated Avoids serialization point Enables lock-free implementationImplementing and Tuning Irregular Programs on GPUs45

Important Traits Scheduling is independent of or can ignoreprevious activities Unordered: all nodes active (once or always) or manynodes active (and easy to determine whether active) Ordered: ordered by level, assigned to worklist in‘correct’ order, poll until ready Easy to approximately sort activities by similarity Reduces divergence and enables other optimizationsImplementing and Tuning Irregular Programs on GPUs46

Beneficial Traits Easy to express iteratively Recursion is not well supported Has statically known range of neighborhoods Enables certain performance optimizations Data structure size (or upper bound) can bedetermined based on input Allows one-time fixed allocation or over-allocationImplementing and Tuning Irregular Programs on GPUs47

Outline Introduction Irregular codes and basic optimizations Irregular optimizations Performance results General guidelines SummaryImplementing and Tuning Irregular Programs on GPUs48

Summary of Optimization hronizationPush versus pullBFS, PTA, SSSPCombined memory fencesBH-summ, BH-treeWait-free pre-passBH-summKernel unrollingBFS, SSSPWarp-centric executionBH-force,

nvcc compiler v4.2 with -arch sm_20 -O3 flags LonestarGPU program inputs BFS, Max-flow, MST, SSSP: road networks, RMAT graphs, and random graphs BH: random star clusters based on Plummer model DMR: randomly generated 2D meshes PTA: constraint graphs from open-source C/C progs SP: randomly generated hard 3-SAT formulae