Using OpenACC With CUDA Libraries - University Of Tennessee

Transcription

Using OpenACC WithCUDA LibrariesJohn Urbanicwith NVIDIAPittsburgh Supercomputing CenterCopyright 2014

3 Ways to Accelerate ProgrammingLanguagesCUDA Libraries areinteroperable with OpenACC“Drop-in”AccelerationEasily AccelerateApplicationsMaximumFlexibility

3 Ways to Accelerate lerationOpenACCDirectivesProgrammingLanguagesCUDA Languages areinteroperable with xibility

CUDA LibrariesOverview

NVIDIA cuBLASNVIDIA cuRANDVector SignalImage ProcessingGPU AcceleratedLinear AlgebraIMSL LibraryBuilding-blockAlgorithms for CUDANVIDIA cuSPARSENVIDIA NPPMatrix Algebra onGPU and MulticoreNVIDIA cuFFTSparse LinearAlgebraC STL Featuresfor CUDAGPU Accelerated Libraries“Drop-in” Acceleration for Your Applications

CUDA Math LibrariesHigh performance math routines for your applications:cuFFT – Fast Fourier Transforms LibrarycuBLAS – Complete BLAS LibrarycuSPARSE – Sparse Matrix LibrarycuRAND – Random Number Generation (RNG) LibraryNPP – Performance Primitives for Image & Video ProcessingThrust – Templated C Parallel Algorithms & Data Structuresmath.h - C99 floating-point LibraryIncluded in the CUDA ToolkitFree download @ www.nvidia.com/getcudaAlways more available at NVIDIA Developer site.

How To Use CUDA LibrariesWith OpenACC

Sharing data with librariesCUDA libraries and OpenACC both operate on device arraysOpenACC provides mechanisms for interop with library callsdeviceptr data clausehost data constructThese same mechanisms are useful for interoperating with customCUDA C, C and Fortran code.

deviceptr Data Clausedeviceptr( list ) Declaresthat the pointers in list refer to devicepointers that need not be allocated or movedbetween the host and device for this pointer.Example:C#pragma acc data deviceptr(d input)Fortran !acc data deviceptr(d input)

host data ConstructMakes the address of device data available on the host.Tells the compiler to use the device address forany variable in list. Variables in the list must bepresent in device memory due to data regions thatcontain this constructuse device( list )ExampleC#pragma acc host data use device(d input)Fortran !acc host data use device(d input)

Example: 1D convolution using CUFFTPerform convolution in frequency space1. Use CUFFT to transform input signal and filter kernel into the frequencydomain2. Perform point-wise complex multiply and scale on transformed signal3. Use CUFFT to transform result back into the time domainWe will perform step 2 using OpenACCCode highlights follow. Code available with exercises in:Exercises/Cufft-acc

Source ExcerptAllocating Data// Allocate host memory for the signal and filterComplex *h signal (Complex *)malloc(sizeof(Complex) * SIGNAL SIZE);Complex *h filter kernel (Complex *)malloc(sizeof(Complex) * FILTER KERNEL SIZE);.// Allocate device memory for signalComplex *d signal;checkCudaErrors(cudaMalloc((void **)&d signal, mem size));// Copy host memory to devicecheckCudaErrors(cudaMemcpy(d signal, h padded signal, mem size, cudaMemcpyHostToDevice));// Allocate device memory for filter kernelComplex *d filter kernel;checkCudaErrors(cudaMalloc((void **)&d filter kernel, mem size));

Source ExcerptSharing Device Data (d signal, d filter kernel)// Transform signal and kernelerror cufftExecC2C(plan, (cufftComplex *)d signal, (cufftComplex *)d signal, CUFFT FORWARD);error cufftExecC2C(plan, (cufftComplex *)d filter kernel, (cufftComplex *)d filter kernel, CUFFT FORWARD);OpenACC// Multiply the coefficients together and normalize the resultRoutineprintf("Performing point-wise complex multiply and scale.\n");complexPointwiseMulAndScale(new size,(float *restrict)d signal,(float *restrict)d filter kernel);// Transform signal backerror cufftExecC2C(plan, (cufftComplex *)d signal,(cufftComplex *)d signal, CUFFT INVERSE);CUDARoutines

OpenACC Convolution Codevoid complexPointwiseMulAndScale(int n, float *restrict signal,float *restrict filter kernel){// Multiply the coefficients together and normalize the result#pragma acc data deviceptr(signal, filter kernel){#pragma acc kernels loop independentfor (int i 0; i n; i ) {float ax signal[2*i];float ay signal[2*i 1];float bx filter kernel[2*i];float by filter kernel[2*i 1];float s 1.0f / n;float cx s * (ax * bx - ay * by);float cy s * (ax * by ay * bx);signal[2*i] cx;signal[2*i 1] cy;Note: The PGI C compiler does not currently support structs in}OpenACC loops, so we cast the Complex* pointers to float*}pointers and use interleaved indexing}

Linking CUFFT#include “cufft.h”Compiler command line options:Must usePGI-providedCUDA toolkit pathsCUDA PATH /opt/pgi/13.10.0/linux86-64/2013/cuda/5.0CCFLAGS -I (CUDA PATH)/include –L (CUDA PATH)/lib64-lcudart -lcufftMust link libcudartand libcufft

Resultinstr009@nid27635: /Cufft aprun -n 1 cufft accTransforming signal cufftExecC2CPerforming point-wise complex multiply and scale.Transforming signal back cufftExecC2CPerforming Convolution on the host and checking correctnessSignal size: 500000, filter size: 33Total Device Convolution Time: 6.576960 ms (0.186368 for point-wise convolution)Test PASSEDCUFFT cudaMemcpyOpenACC

SummaryUse deviceptr data clause to pass pre-allocated device data toOpenACC regions and loopsUse host data to get device address for pointers inside acc dataregionsThe same techniques shown here can be used to share devicedata between OpenACC loops andYour custom CUDA C/C /Fortran/etc. device codeAny CUDA Library that uses CUDA device pointers

AppendixCompelling Cases For Various LibrariesOf Possible Interest To You

cuFFT: Multi-dimensional FFTsNew in CUDA 4.1Flexible input & output data layouts for all transform typesSimilar to the FFTW “Advanced Interface”Eliminates extra data transposes and copiesAPI is now thread-safe & callable from multiple host threadsRestructured documentation to clarify data layouts

FFTs up to 10x Faster than MKL1D used in audio processing and as a foundation for 2D and 3D FFTscuFFT Single 04050200CUFFT160GFLOPSGFLOPS400cuFFT Double Precision013579 11 13 15 17 19 21 23 25Log2(size)Performance may vary based on OS version and motherboard configuration13579 11 13 15 17 19 21 23 25Log2(size) Measured on sizes that are exactly powers-of-2 cuFFT 4.1 on Tesla M2090, ECC on MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz

CUDA 4.1 optimizes 3D transformsSingle Precision All Sizes 2x2x2 to 128x128x128GFLOPS180160CUFFT 4.1140CUFFT 4.0120MKL10080Consistently fasterthan MKL6040 3x faster than 4.0on average20001632486480Size (NxNxN)Performance may vary based on OS version and motherboard configuration96112128 cuFFT 4.1 on Tesla M2090, ECC on MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz

cuBLAS: Dense Linear Algebra on GPUsComplete BLAS implementation plus useful extensionsSupports all 152 standard routines for single, double, complex, anddouble complexNew in CUDA 4.1New batched GEMM API provides 4x speedup over MKLUseful for batches of 100 small matrices from 4x4 to 128x1285%-10% performance improvement to large GEMMs

cuBLAS Level 3 PerformanceUp to 1 TFLOPS sustained performance and 6x speedup over Intel MKLGFLOPSSpeedup over DoubleComplexPerformance may vary based on OS version and motherboard YRKZTRMMZTRSM2SingleComplexDoubleDoubleComplex 4Kx4K matrix size cuBLAS 4.1, Tesla M2090 (Fermi), ECC on

ZGEMM Performance vs Intel 01005000256512768Performance may vary based on OS version and motherboard configuration10241280Matrix Size (NxN)15361792 cuBLAS 4.1 on Tesla M2090, ECC on MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz2048

GFLOPScuBLAS Batched GEMM API improvesperformance on batches of small matricescuBLAS 100 matrices20018016014012010080604020001632cuBLAS 10,000 matrices486480Matrix Dimension (NxN)Performance may vary based on OS version and motherboard configurationMKL 10,000 matrices96112128 cuBLAS 4.1 on Tesla M2090, ECC on MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @ 3.33 GHz

cuSPARSE: Sparse linear algebra routinesSparse matrix-vector multiplication & triangular solveAPIs optimized for iterative methodsNew in 4.1Tri-diagonal solver with speedups up to 10x over Intel MKLELL-HYB format offers 2x faster matrix-vector multiplication𝑦1𝑦2𝑦3 𝛼𝑦41.0 2.0 3.0 4.0 5.0 6.0 7.01.02.0 𝛽3.04.0𝑦1𝑦2𝑦3𝑦4

cuSPARSE is 6x Faster than Intel MKLSparse Matrix x Dense Vector PerformanceSpeedup over Intel MKLcsrmv*hybmv*76543210*Average speedup over single, double, single complex & double-complexPerformance may vary based on OS version and motherboard configuration cuSPARSE 4.1, Tesla M2090 (Fermi), ECC on MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core

Up to 40x faster with 6 CSR Vectors6050cuSPARSE Sparse Matrix x 6 Dense Vectors (csrmm)Useful for block iterative solve schemessingledoublesingle complexdouble complexSpeedup over MKL403020100Performance may vary based on OS version and motherboard configuration cuSPARSE 4.1, Tesla M2090 (Fermi), ECC on MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @

Tri-diagonal solver performance vs. MKLSpeedup for Tri-Diagonal solver (gtsv)*singledoublecomplexdouble complex16Speedup over Intel MKL1412108642016384131072*Parallel GPU implementation does not include pivotingPerformance may vary based on OS version and motherboard configuration1048576Matrix Size (NxN)20971524194304 cuSPARSE 4.1, Tesla M2090 (Fermi), ECC on MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 Six-Core @

cuRAND: Random Number GenerationPseudo- and Quasi-RNGsSupports several output distributionsStatistical test results reported in documentationNew commonly used RNGs in CUDA 4.1MRG32k3a RNGMTGP11213 Mersenne Twister RNG

cuRAND Performance compared to Intel MKLDouble PrecisionUniform DistributionDouble PrecisionNormal DistributionCURAND XORWOW12CURAND XORWOW2.51086420CURAND MTGP32CURAND 32 Bit SobolCURAND 32 Bit ScrambledSobolCURAND 64 Bit SobolCURAND 64 bit ScrambledSobolMKL MRG32k3aCURAND MRG32k3aGiga-Samples / SecondGiga-Samples / SecondCURAND MRG32k3a2CURAND MTGP32CURAND 32 Bit Sobol1.5CURAND 32 Bit ScrambledSobol10.5CURAND 64 Bit SobolCURAND 64 bit ScrambledSobolMKL MRG32k3aMKL 32 Bit Sobol0Performance may vary based on OS version and motherboard configurationMKL 32 Bit Sobol cuRAND 4.1, Tesla M2090 (Fermi), ECC on MKL 10.2.3, TYAN FT72-B7015 Xeon x5680 @

C STL Features for CUDAAlgorithms for CUDA Sparse Linear IMSL Library Algebra GPU Accelerated Libraries "Drop-in" Acceleration for Your Applications Building-block . CUDA Math Libraries High performance math routines for your applications: