INTRODUCTION TO GPU COMPUTING

Transcription

INTRODUCTION TO GPUCOMPUTINGJeff Larkin, June 28, 2018

GAMINGAUTOENTERPRISEHPC & CLOUDOEM & IPTHE WORLD LEADER IN VISUAL COMPUTING22

Power for CPU-onlyExaflop Supercomputer Power for the Bay Area, CA(San Francisco San Jose)HPC’s Biggest Challenge: Power33

44

CUDA ECOSYSTEM 2018CUDADOWNLOADSIN ENDEES8,000 7

CUDA APPLICATION ECOSYSTEMFrom Ease of Use to Specialized PerformanceCUDA-C CUDA FortranApplicationsFrameworksLibrariesDirectives andStandard LanguagesSpecializedLanguages8

FUNDAMENTALS OFGPU COMPUTING9

ACCELERATED COMPUTING10X PERFORMANCE & 5X ENERGY EFFICIENCY FOR HPCGPU AcceleratorOptimized forCPUOptimized forSerial TasksParallel Tasks10

ACCELERATED COMPUTING10X PERFORMANCE & 5X ENERGY EFFICIENCY FOR HPCCPUStrengthsGPUAcceleratorOptimized forCPU Very large main memoryOptimized forSerial TasksParallel Tasks Very fast clock speeds Latency optimized via large caches Small number of threads can runvery quicklyCPU Weaknesses Relatively low memory bandwidth Cache misses very costly Low performance/watt11

ACCELERATED COMPUTING10X PERFORMANCE & 5X ENERGY EFFICIENCY FOR HPCGPU StrengthsGPU AcceleratorOptimized forCPUHigh bandwidth main memory Optimizedfor Latency tolerantvia parallelism SignificantlySerialmore Taskscomputeresources High throughput High performance/wattParallel TasksGPU Weaknesses Relatively low memory capacity Low per-thread performance12

ACCELERATED COMPUTING10X PERFORMANCE & 5X ENERGY EFFICIENCY FOR HPCGPU AcceleratorOptimized forCPUOptimized forSerial TasksParallel Tasks13

Speed v. ThroughputSpeedThroughputWhich is better depends on your needs *Images from Wikimedia Commons via Creative Commons14

Accelerator NodesRAMRAMCPU and GPU have distinctmemories CPU generally larger and slower GPU generally smaller and fasterExecution begins on the CPU Data and computation areoffloaded to the GPUCPU and GPU communicate via PCIe Data must be copied betweenthese memories over PCIe PCIe Bandwidth is much lowerthan either memoriesPCIe15

HOW GPU ACCELERATION WORKSApplication CodeCompute-Intensive FunctionsGPURest of SequentialCPU CodeSmall % ofCode CPU16

3 WAYS TO PROGRAM ingLanguages“Drop-in”AccelerationEasily AccelerateApplicationsMaximumFlexibility17

SIMPLICITY & PERFORMANCESimplicityAccelerated LibrariesLittle or no code change for standard libraries; high performanceLimited by what libraries are availableCompiler DirectivesHigh Level: Based on existing languages; simple and familiarHigh Level: Less control over performanceParallel Language ExtensionsExpose low-level details for maximum performancePerformanceOften more difficult to learn and more time consuming to implement18

CODE FOR SIMPLICITY & PERFORMANCELibraries Implement as much as possible usingportable libraries.Directives Use directives to rapidlyaccelerate your code.Languages Use lower level languagesfor important kernels.19

GPU DEVELOPER ECO-SYSTEMNumericalPackagesMATLABMathematicaNI LabViewpyCUDAConsultants & TrainingANEODebuggers& Profilerscuda-gdbNV Visual ProfilerParallel NsightVisual StudioAllineaTotalViewGPU CompilersCC FortranJavaPythonAuto-parallelizing& Cluster KNPPVideoImagingGPULibOEM Solution ProvidersGPU Tech20

GPU LIBRARIES21

LIBRARIES: EASY, HIGH-QUALITY ACCELERATIONEASE OF USE Using libraries enables GPU acceleration without in-depthknowledge of GPU programming“DROP-IN” Many GPU-accelerated libraries follow standard APIs, thusenabling acceleration with minimal code changesQUALITY Libraries offer high-quality implementations of functionsencountered in a broad range of applicationsPERFORMANCE NVIDIA libraries are tuned by experts22

GPU ACCELERATED LIBRARIES“Drop-in” Acceleration for Your ApplicationsLinear AlgebraFFT, BLAS,SPARSE, MatrixNVIDIAcuFFT,cuBLAS,cuSPARSENumerical & MathRAND, StatisticsNVIDIAMath LibNVIDIA cuRANDData Struct. & AISort, Scan, Zero SumVisual ProcessingImage & VideoGPU AI – BoardGamesNVIDIANPPGPU AI –Path FindingNVIDIAVideoEncode23

DROP-IN ACCELERATIONIn Two Easy Stepsint N 1 20;// 1M elementsint N 1 20;// 1M elementsx (float *)malloc(N * sizeof(float));y (float *)malloc(N * sizeof(float));initData(x, y);x (float *)malloc(N * sizeof(float));y (float *)malloc(N * sizeof(float));initData(x, y);// Perform SAXPY on 1M elements: y[] a*x[] y[]saxpy(N, 2.0, x, 1, y, 1);// Perform SAXPY on 1M elements: y[] a*x[] y[]saxpy(N, 2.0, x, 1, y, 1);useResult(y);useResult(y);24

DROP-IN ACCELERATIONWith Automatic Data Managementint N 1 20;// 1M elementsint N 1 20;// 1M elementsx (float *)malloc(N * sizeof(float));y (float *)malloc(N * sizeof(float));initData(x, y);cudaMallocManaged(&x, N * sizeof(float));cudaMallocManaged(&y, N * sizeof(float));initData(x, y);// Perform SAXPY on 1M elements: y[] a*x[] y[]saxpy(N, 2.0, x, 1, y, 1);// Perform SAXPY on 1M elements: y[] a*x[] y[]saxpy(N, 2.0, x, 1, y, 1);useResult(y);useResult(y);Step 1: Update memory allocation to be CUDA-awareHere, we use Unified Memory which automatically migratesbetween host (CPU) and device (GPU) as needed by the program25

DROP-IN ACCELERATIONWith Automatic Data Managementint N 1 20;// 1M elementsint N 1 20;// 1M elementsx (float *)malloc(N * sizeof(float));y (float *)malloc(N * sizeof(float));initData(x, y);cudaMallocManaged(&x, N * sizeof(float));cudaMallocManaged(&y, N * sizeof(float));initData(x, y);// Perform SAXPY on 1M elements: y[] a*x[] y[]saxpy(N, 2.0, x, 1, y, 1);// Perform SAXPY on 1M elements: y[] a*x[] y[]cublasSaxpy(N, 2.0, x, 1, y, 1);useResult(y);useResult(y);Step 2: Call CUDA library version of APIMany standard libraries (BLAS, FFT, etc) have well-defined interfacesCUDA will try to match interfaces as far as possible26

DROP-IN ACCELERATIONWith Explicit Data Managementint N 1 20;// 1M elementsx (float *)malloc(N * sizeof(float));y (float *)malloc(N * sizeof(float));initData(x, y);// Perform SAXPY on 1M elements: y[] a*x[] y[]saxpy(N, 2.0, x, 1, y, 1);useResult(y);Step 3: Manage Data LocalityIf not using unified memory, the programmoves the data up to the GPU and backint N 1 20;// 1M elementsx (float *)malloc(N * sizeof(float));y (float *)malloc(N * sizeof(float));cudaMalloc(&d x, N * sizeof(float));cudaMalloc(&d y, N * sizeof(float));initData(x, y);// Copy working data from CPU- GPUcublasSetVector(N, sizeof(x[0]), x, 1, d x, 1);cublasSetVector(N, sizeof(y[0]), y, 1, d y, 1);// Perform SAXPY on 1M elements: y[] a*x[] y[]cublasSaxpy(N, 2.0, d x, 1, d y, 1);// Bring the result back to the CPUcublasGetVector(N, sizeof(y[0]), d y, 1, y, 1);useResult(y);27

EXPLORE CUDA aries28

OPENACC DIRECTIVES29

OpenACC is a directivesbased programming approachto parallelcomputingdesigned for performanceand portability on CPUsAdd Simple Compiler Directivemain(){ serial code #pragma acc kernels{ parallel code }}and GPUs for HPC.30

University of Illinoismain(){ serial code #pragma acc kernelsPowerGrid- MRI Reconstruction//automatically runs on GPU{OpenACC parallel code }}70x Speed-Up2 Days of EffortSimple Powerful PortableRIKEN JapanNICAM- Climate ModelingFueling the Next Wave of8000 Scientific Discoveries in HPCDevelopers7-8x Speed-Up5% of Code Modifiedusing ources/OpenACC 213462.12 OpenACC Cosmo CS y-xk73131

SINGLE CODE FOR MULTIPLE PLATFORMSOpenACC - Performance Portable Programming Model for HPCAWE Hydrodynamics CloverLeaf mini-App, bm32 data set77x80xPGI OpenACCPOWERx86 CPUx86 Xeon PhiNVIDIA GPUPEZY-SCSpeedup vs Single Haswell CoreSunwayIntel OpenMP60xIBM OpenMP52x40x20x9x0x9xDual Haswell10x10xDual Broadwell11x11xDual POWER81 TeslaP1001 TeslaV100Systems: Haswell: 2x16 core Haswell server, four K80s, CentOS 7.2 (perf-hsw10), Broadwell: 2x20 core Broadwell server, eight P100s (dgx1-prd-01), Minsky: POWER8 NVLINK, four P100s,RHEL 7.3 (gsn1).Compilers: Intel 17.0, IBM XL 13.1.3, PGI 16.10.Benchmark: CloverLeaf v1.3 downloaded from http://uk-mac.github.io/CloverLeaf the week of November 7 2016; CloverlLeaf Serial; CloverLeaf ref (MPI OpenMP); CloverLeaf OpenACC(MPI OpenACC)Data compiled by PGI November 2016, Volta data collected June 201732

OpenACC COMPILER DIRECTIVESParallel C CodeParallel Fortran Codevoid saxpy(int n,float a,float *x,float *y){#pragma acc kernelsfor (int i 0; i n; i)y[i] a*x[i] y[i];}subroutine saxpy(n, a, x, y)real :: x(:), y(:), ainteger :: n, i! acc kernelsdo i 1,ny(i) a*x(i) y(i)enddo! acc end kernelsend subroutine saxpy.// Perform SAXPY on 1M elementssaxpy(1 20, 2.0, x, y);.! Perform SAXPY on 1M elementscall saxpy(2**20, 2.0, x d, y d).http://developer.nvidia.com/openacc or http://openacc.org33

PROGRAMMINGLANGUAGES34

Standard Cvoid saxpy(int n, float a,float *x, float *y){for (int i 0; i n; i)y[i] a*x[i] y[i];}CUDA CParallel Cglobalvoid saxpy(int n, float a,float *x, float *y){int i blockIdx.x*blockDim.x threadIdx.x;if (i n) y[i] a*x[i] y[i];}int N 1 20;int N 1 20;cudaMemcpy(d x, x, N, cudaMemcpyHostToDevice);cudaMemcpy(d y, y, N, cudaMemcpyHostToDevice);// Perform SAXPY on 1M elementssaxpy(N, 2.0, x, y);// Perform SAXPY on 1M elementssaxpy 4096,256 (N, 2.0, d x, d y);cudaMemcpy(y, d y, N, m/cuda-toolkit35

THRUST C TEMPLATE LIBRARYSerial C CodeParallel C Codewith STL and Boostint N 1 20;std::vector float x(N), y(N);int N 1 20;thrust::host vector float x(N), y(N);.thrust::device vector float d x x;thrust::device vector float d y y;// Perform SAXPY on 1M elementsstd::transform(x.begin(), x.end(),y.begin(), y.end(),2.0f * 1 2);www.boost.org/libs/lambda// Perform SAXPY on 1M elementsthrust::transform(d x.begin(), d x.end(),d y.begin(),d y.begin(),2.0f * 1 2)http://thrust.github.com36

Standard FortranCUDA FORTRANParallel Fortranmodule mymodule containssubroutine saxpy(n, a, x, y)real :: x(:), y(:), ainteger :: n, ido i 1,ny(i) a*x(i) y(i)enddoend subroutine saxpyend module mymodulemodule mymodule containsattributes(global) subroutine saxpy(n, a, x, y)real :: x(:), y(:), ainteger :: n, iattributes(value) :: a, ni threadIdx%x (blockIdx%x-1)*blockDim%xif (i n) y(i) a*x(i) y(i)end subroutine saxpyend module mymoduleprogram mainuse mymodulereal :: x(2**20), y(2**20)x 1.0, y 2.0! Perform SAXPY on 1M elementscall saxpy(2**20, 2.0, x, y)end program mainprogram mainuse cudafor; use mymodulereal, device :: x d(2**20), y d(2**20)x d 1.0, y d 2.0! Perform SAXPY on 1M elementscall saxpy 4096,256 (2**20, 2.0, x d, y d)end program mainhttp://developer.nvidia.com/cuda-fortran37

Standard PythonPYTHON Numba Parallel Pythonimport numpy as npimport numpy as npfrom numba import vectorizedef saxpy(a, x, y):return [a * xi yifor xi, yi in zip(x, y)]@vectorize(['float32(float32, float32,float32)'], target 'cuda')def saxpy(a, x, y):return a * x yx np.arange(2**20, dtype np.float32)y np.arange(2**20, dtype np.float32)cpu result saxpy(2.0, x, y)N 1048576#ABCInitialize arrays np.ones(N, dtype np.float32) np.ones(A.shape, dtype A.dtype) np.empty like(A, dtype A.dtype)# Add arrays onGPUC saxpy(2.0, X, Y)http://numpy.scipy.orghttps://numba.pydata.org38

ENABLING ENDLESS WAYS TO SAXPYCUDAC, C , Fortran Build front-ends for Java, Python, R, DSLs Target other processors like ARM, FPGA, GPUs,x86CUDA Compiler Contributed toOpen Source LLVMNew LanguageSupportLLVM CompilerFor CUDANVIDIAGPUsx86CPUsNew ProcessorSupport39

CUDA TOOLKIT – DOWNLOAD TODAY!Everything You Need to Accelerate ApplicationsGETTING STARTED RESOURCESCUDA DOCUMENTATIONInstallationGuideBest PracticesGuideProgrammingGuideCUDA ToolsGuideAPI ReferenceSamplesINDUSTRY APPLICATIONSdeveloper.nvidia.com/cuda-toolkit40

7 cuda ecosystem 2018 cuda downloads in 2017 3,500,00