Amgx Gpu Solver Developments For Openfoam

Transcription

AMGX GPU SOLVERDEVELOPMENTS FOR OPENFOAMMatt Martineau, Stan Posey, Filippo Spiga 14-Oct-2020

SUMMARYExtended PETSc4FOAM library (from members of the HPC TC) toaccelerate pressure solves with AmgXEarly results of the AmgX solver library used to accelerate theOpenFOAM pressure solve on GPUs achieved 4x to 8xspeedupsA new library, FOAM2CSR, for low-overhead conversion betweenOpenFOAM LDU matrices and GPU-resident CSR matricesMulti-GPU/multi-node capability, with ongoing performanceoptimisationPETSc4FOAM: A Library to plug-in PETSc into the OpenFOAM Framework2

AMGX FOR OPENFOAMOpen-source sparse iterative solver libraryFully GPU accelerated library and highlyconfigurableAlgebraic multigrid (AMG) preconditioning"solver": {"preconditioner": {"solver": "AMG","cycle": "V","smoother": {"solver": "BLOCK JACOBI"},"max iters": 1,"max levels": 25,"interpolator": "D2","presweeps": 1,"postsweeps": 1},"solver": "PCG","max iters": 100,"convergence": "ABSOLUTE","tolerance": 1e‐04,"norm": "L1"}In this study: PCG AMGAll results refer to the v2.1.x pre-release branchAmgX configuration for AMG PCGSignificant ( 2x) setup performance increasesImproved support for new versions of CUDA (10, 11)https://github.com/NVIDIA/AMGX3

BASIC INITIAL SOLUTIONFirst pass at GPU accelerationExtending PETSc4FOAM infrastructure tocall into AmgXWrapper to drive AmgXAmgXWrapper accepts PETSc esAmgXWrapperAmgXInitial performance slower than CPUMuch performance critical code notresident on the GPUHigh level call structure for initialacceleration approach using AmgXhttps://github.com/barbagroup/AmgXWrapper4

INITIAL SOLUTION PROFILINGSearching for optimisation potential1 MM cell case on DGX-1 using V100 and single BDW coreAccelerate buildMat and reduce overhead ofamgxWrapperSetATaskTimeBuild matrix4.7sGet local matrix0.4sUpload matrix0.6sSetup1.1sWrap PETSc vector0.1sPressure solve0.2sOut of 7.2s pressure solve time only1.4s is effective GPU workAMGX solver setup required on first step; in subsequentsteps this can be replaced with AMGX solver resetup“Wrap PETSc vector” can be avoided5

ADAPTED SOLUTIONTo improve ScPETScCSRstructuresPETSc AM2CSR implemented to increase the amountof computational workload resident on the GPUAmgXWrapper is extended and optimised tosupport CSR and improve host utilization6

FOAM2CSR APPROACHOpenFOAM LDU to GPU-resident CSRFOAM2CSR Algorithm:(1) Copy/reorganise LDU matrix data ready for conversiondiagAddrpermcolIndicesrowIndicesvalues [[[[[0, 1, ,0, 1, ,diagAddrdiagAddrdiagNrows ]Nnz ]upperAddr lowerAddr ]lowerAddr upperAddr ]upperlower](2) Sort perm and rowIndices, by rowIndices (radix sort)LDU matrix visualisation(taken from S. Bnà, I. Spisso, M. Olesen, G. Rossi.PETSc4FOAM: A Library to plug-in PETSc into the OpenFOAM Framework)(3) Collapse rowIndices to rowOffsets (exclusive scan)(4) Sort colIndices and values by permAfter the first step - low overhead conversion7

AMGX / AMGXWRAPPER CHANGESImproving integration and performanceAdded OpenFOAM residual calculation to AmgXFixed the default partitioning scheme in AmgXWe extended AmgXWrapper to:Handle raw CSR inputs, either host or device pointersSupport updating matrix coefficients only, and resetup, a fast setup for timesteps where sparsitypatterns persist/* CSR matrix to AmgX */setA(int nGlobalRows,int nLocalRows,int nLocalNz,int* rowOffsets,int* colIndicesGlobal,double* values,int* partData);/* UpdateErrorCodeconstconstconstCSR matrix values in AmgX */updateA(int nLocalRows,int nLocalNz,double* values);/* Performs the linear solve in AmgX */ErrorCode solve(dobule* solution,const double* rhs,const int nRows);Perform matrix consolidation using CUDA IPC callshttps://github.com/barbagroup/AmgXWrapper8

AMGXWRAPPER CONSOLIDATIONMerging matrix elements for performanceRank 1A1Rank 0A0A1Rank 2A2Rank 3A3Rank 2A2A3We developed a consolidation feature inAmgXWrapper that is low overheadCPU cores can be saturated for improved simulationruntime - around 8x wallclock speedup single GPU9CUDA IPCPerformance limited by the single core restrictiondue to the CPU-resident momentum solves etc.A0CUDA IPCMatrix assembly, momentum solves, etc.Rank 0

EXPERIMENTAL SETUPProblem and systemUsing the HPC committee 3D lid-driven cavity modeldescribed in the PRACE hpcThe medium (M) test problem fits adequately on asingle GPU (200x200x200 or total 8 MM cells)Lid Driven cavity(M, 200x200x200, 20 steps) solution,accelerated with AmgX10

RESULTS – PRESSURE SOLVEM problem (8 MM cells, 100 iters) on DGX-1Measuring all steps required tofulfil the pressure solve, i.e.LDU2CSR, comms, memorycopies, solve etc.Significant GPU speedups overFOAM-GAMG of 4x to 8xNew A100 GPU 1.6x faster thanV100 for A100 speedups overFOAM-GAMG of 6x to 13xStill room for improvement11

RESULTS - WALLCLOCKM problem (8 MM cells, 100 iters) on DGX-1Full solution wallclock isreasonable considering onlypressure solve (now 35% oftotal) is GPU acceleratedOverhead of acceleration(i.e. copying data to andfrom the GPU) is smallCould be greatly improved byaccelerating matrix assemblyand momentum solves12

RESULTS – WEAK SCALINGPreliminary Multi-GPU Results on DGX-1The solver can be run on multipleGPUs across multiple nodesConsolidation for CPU cores alsoworks in multi-GPU configurationData movement is minimal, but couldbe removed if prior steps acceleratedSetup scaling limits non-cached caseThe limitation is well understood,and we are currently optimising13

CONCLUSIONSEarly results showcase the OpenFOAM pressure solveaccelerated on NVIDIA V100 GPUs using AmgX, achieving 4x to 8x speedupA new library, FOAM2CSR, was developed for low-overheadconversion between OpenFOAM LDU matrices and GPUresident CSR matricesChanges to AmgX, and AmgXWrapper, enable integrationwith OpenFOAM and improved performanceThe multi-GPU/multi-node implementation is fullyfunctional and performance optimisation is ongoing14

Extended PETSc4FOAM library (from members of the HPC TC) to accelerate pressure solves with AmgX Early results of the AmgX solver library used to accelerate the OpenFOAM pressure solve on GPUs achieved 4x to 8x speedups A new library, FOAM2CSR, for low-overhead conversion between OpenFOAM LDU matrices and GPU-resident CSR matrices