INTRODUCTION TO CUDA C

Transcription

INTRODUCTION TO CUDA C Jeff Larkin, June 28, 2018

CUDA C/C AND ammingLanguages“Drop-in”AccelerationEasily AccelerateApplicationsMaximumFlexibility2

HETEROGENEOUS COMPUTINGTerminology:HostThe CPU and its memory (host memory)DeviceThe GPU and its memory (device memory)HostDevice3

SIMPLE EXECUTION MODELHostSerial RegionDeviceParallel RegionHostSerial Region4

NVCC COMPILERNVIDIA provides a CUDA-C compilernvccNVCC splits your code in 2: Host code and Device code.Host code forwarded to CPU compiler (usually g )Device code sent to NVIDIA device compilerNVCC is capable of linking together both host and device code into a single executableConvention: C source files containing CUDA syntax are typically given the extension .cu.5

EXAMPLE 1: HELLO WORLDint main() {printf("Hello, World!\n");return 0;}Terminology:“Kernel” – A function called on the GPU by all threads participating in a calculation.6

OUR FIRST KERNELglobal void mykernel(void) {The global annotation informs the compiler thatthis is a kernel, which will be invoked on the devicefrom the host.}int main(void) {mykernel 1,1 ();printf("Hello, World!\n");return 0;}The angle bracket, or “chevron”, syntax informs thecompiler how many copies of the kernel “mykernel”to invoke. Here we will invoke is once.7

OUR FIRST KERNELglobal void mykernel(void) {printf("Hello, World!\n");Move the work into the kernel.}int main(void) {mykernel 1,1 ();cudaDeviceSynchronize();return 0;}Tell the host to wait until the device is finished.8

COMPILING AND RUNNINGCompile the code with NVCC nvcc main.cuRun the resulting executable ./a.outHello, World!9

PARALLEL PROGRAMMING IN CUDA C/C But wait GPU computing is about massive parallelism!We need a more interesting example We’ll start by adding two integers and build up to vector additionabc10

EXAMPLE 2: VECTOR ADDITIONvoid vecadd(int *a, int *b, int *c, int N){for(int i 0;i N;i )c[i] a[i] b[i];}Plan of Attack:1.2.3.4.Move addition to element-wise functionMake new function a kernelMake vectors available on the deviceInvoke the new GPU kernel11

VECADD: STEP 1, ELEMENT-WISE FUNCTION// Compute 1 element of c from a and bvoid vecadd kernel(int *a, int *b, int *c, int N, int i)) {if ( i N ) // Protect against out-of-bounds errorc[i] a[i] b[i];}void vecadd(int *a, int *b, int *c, int N) {for(int i 0;i N;i )vecadd kernel(a, b, c, N, i);}This new function calculatesonly the ith element of c.For now, we’ll just replacethe loop body.12

THREAD HIERARCHY IN CUDAThreadThreadBlockGrid13

VECADD: STEP 2, MAKE A KERNELAdd global attribute tomake it a kernel.// Compute 1 element of c from a and bglobal void vecadd kernel(int *a, int *b, int *c, int N)) {int i threadIdx.x; // Calculate my indexif ( i N ) // Protect against out-of-bounds errorc[i] a[i] b[i];}Each thread knows it’s indexvoid vecadd(int *a, int *b, int *c, int N) { in the thread hierarchy.for(int i 0;i N;i )vecadd kernel(a, b, c, N, i);}We’ll fix this in step 4.14

VECADD: STEP 3, MANAGE DATAint main() {int N 512;int *a, *b, *c;Malloced memory is onlyavailable on the host.a (int*)malloc(N*sizeof(int));b (int*)malloc(N*sizeof(int));c (int*)malloc(N*sizeof(int)); ;vecadd(a, b, c, N); ;free(a);free(b);free(c);return 0;}15

CUDA Memory ManagementNo Unified MemorySystemMemoryGPU MemoryUnified MemoryUnified Memory16

VECADD: STEP 3, MANAGE DATAint main() {int N 512;int *a, *b, *c;Replace malloc() MallocManaged(&c,N*sizeof(int)); ;vecadd(a, b, c, N); ;cudaFree(a);cudaFree(b);cudaFree(c);return 0;}Replace free() withcudaFree().17

VECADD: STEP 4, INVOKE KERNEL// Compute 1 element of c from a and bglobal void vecadd kernel(int *a, int *b, int *c, int N)) {int i threadIdx.x; // Calculate my indexif ( i N ) // Protect against out-of-bounds errorc[i] a[i] b[i];}void vecadd(int *a, int *b, int *c, int N) {vecadd kernel 1,N (a, b, c, N);cudaDeviceSynchronize();}Launch vecadd kernel() on 1thread block with N threads.Ensure kernel completesbefore vecadd() returns.18

VECADD: STEP 4, INVOKE KERNEL// Compute 1 element of c from a and bglobal void vecadd kernel(int *a, int *b, int *c, int N)) {int i threadIdx.x; // Calculate my indexif ( i N ) // Protect against out-of-bounds errorc[i] a[i] b[i];}void vecadd(int *a, int *b, int *c, int N) {vecadd kernel N,1 (a, b, c, N);cudaDeviceSynchronize();}Launch vecadd kernel() on Nthread block with 1 thread.Ensure kernel completesbefore vecadd() returns.19

COMBINING BLOCKS AND THREADSWe’ve seen parallel vector addition using:Several blocks with one thread eachOne block with several threadsTo utilize all the cores we need to use both blocks and threadsLet’s adapt vector addition to use both blocks and threadsFirst let’s discuss data indexing 20

BUILT-IN VARIABLESBuilt-in Thread index within the blockBlock index within the gridNumber of threads in a blockNumber of blocks in a gridThese exist automatically in CUDA kernelsRead only (set by the runtime)21

INDEXING ARRAYS WITH BLOCKS ANDTHREADSNo longer as simple as using blockIdx.x and threadIdx.xConsider indexing an array with one element per thread (8 eadIdx.x0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7blockIdx.x 0blockIdx.x 1blockIdx.x 2blockIdx.x 3With blockDim.x threads per block, a unique index for each thread is given by:int index blockIdx.x * blockDim.x threadIdx.x22

INDEXING ARRAYS: EXAMPLEWhich thread will operate on the red element?012345678910 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31M 801234threadIdx.x 5567012345670123456701234567blockIdx.x 2int index blockIdx.x * blockDim.x threadIdx.x; 2* 8 5; 21;23

VECADD: STEP 4, INVOKE KERNEL// Compute 1 element of c from a and bglobal void vecadd kernel(int *a, int *b, int *c, int N)) {int i blockIdx.x * blockDim.x threadIdx.x;if ( i N ) // Protect against out-of-bounds errorc[i] a[i] b[i];}void vecadd(int *a, int *b, int *c, int N) {vecadd kernel N/1024,1024 (a, b, c, N);cudaDeviceSynchronize();}Ensure kernel completesbefore vecadd() returns.Launch vecadd kernel() onN/1024 thread blocks of 1024threads.24

BEST PRACTICE: ARBITRARY SIZE VECTORS// Compute 1 element of c from a and bglobal void vecadd kernel(int *a, int *b, int *c, int N)) {int i blockIdx.x * blockDim.x threadIdx.x;if ( i N ) // Protect against out-of-bounds errorc[i] a[i] b[i];}void vecadd(int *a, int *b, int *c, int N) {vecadd kernel (N 1023)/1024/1024,1024 (a, b, c, N);cudaDeviceSynchronize();}If N is not evenly divisible by1024, this will ensure enoughblocks are created to coverall data elements.25

CUDA MEMORY MANAGEMENTWithout Unified Memoryvoid sortfile(FILE *fp, int N) {char *data, *d data;data (char*) malloc(N);cudaMalloc (&d data, N);Unified Memoryvoid sortfile(FILE *fp, int N) {char *data;cudaMallocManaged(&data, N);fread(data, 1, N, fp);fread(data, 1, N, fp);cudaMemcpy(d data,data,N,H2D);qsort . (d data,N,1,compare);cudaMemcpy(data,d data,N,D2H);qsort . (data,N,1,compare);cudaDeviceSynchronize();use data(data);use data(data);cudaFree(data);free(data);cudaFree(d unified-memory-in-cuda-6/26

CUDA MEMORY MANAGEMENTcudaMalloc & cudaMemcpyExplicitly track host and device memoryExplicitly relocate data (sync or async)Expresses data locality (most performance)cudaMallocManagedSingle pointer for host & device memoryTransfer at launch and syncData paged to the host on demandDevice paging from the host in future hardwareAdvice: Develop with cudaMallocManaged then optimize to cudaMalloc/cudaMemcpyif necessary27

VECADD: EXPLICITLY MANAGE DATAint main() {int N 512;int *a, *a d, *b, *b d, *c, *c ;cudaMalloc(&a d,N*sizeof(int));cudaMalloc(&b d,N*sizeof(int));cudaMalloc (&c d,N*sizeof(int));Explicitly copydata to and fromthe device.Using this special allocatorwill speed up data transfers.Use cudaMalloc to allocatedevice arrays ;cudaMemcpy(a d, a, N*sizeof(int),cudaMemcpyHostToDevice);cudaMemcpy(b d, b, N*sizeof(int),cudaMemcpyHostToDevice);vecadd(a d, b d, c d, N);cudaMemcpy(c, c d, N*sizeof(int),cudaMemcpyDeviceToHost); ;}28

CLOSING SUMMARYCUDA C/C and Fortran provide close-to-the-metal performance, but may requirerethinking your code.CUDA programming explicitly replaces loops with parallel kernel execution.Using CUDA Managed Memory simplifies data management by allowing the CPU andGPU to dereference the same pointer.29

CUDA C/C and Fortran provide close-to-the-metal performance, but may require rethinking your code. CUDA programming explicitly replaces loops with parallel kernel execution. Using CUDA Managed Memory simplifies data management by