Parallel Programming With CUDA Ian Buck - Stanford University

Transcription

M02: High Performance Computing with CUDAParallel Programming with CUDAIan Buck

OutlineCUDA modelCUDA programming basicsToolsGPU architecture for computingQ&AM02: High Performance Computing with CUDA2

What is CUDA?C with minimal extensionsCUDA goals:Scale code to 100s of coresScale code to 1000s of parallel threadsAllow heterogeneous computing:For example: CPU GPUCUDA defines:Programming modelMemory modelM02: High Performance Computing with CUDA3

CUDA Programming ModelParallel code (kernel) is launched and executed on adevice by many threadsThreads are grouped into thread blocksParallel code is written for a threadEach thread is free to execute a unique code pathBuilt-in thread and block ID variablesM02: High Performance Computing with CUDA4

Thread HierarchyThreads launched for a parallel section arepartitioned into thread blocksGrid all blocks for a given launchThread block is a group of threads that can:Synchronize their executionCommunicate via shared memoryM02: High Performance Computing with CUDA5

IDs and DimensionsThreads:3D IDs, unique within a blockDeviceGrid 1Blocks:2D IDs, unique within a gridDimensions set at launch timeCan be unique for each sectionBuilt-in variables:threadIdx, blockIdxblockDim, gridDimBlock(0, 0)Block(1, 0)Block(2, 0)Block(0, 1)Block(1, 1)Block(2, 1)Block (1, 1)Thread Thread Thread Thread Thread(0, 0)(1, 0)(2, 0)(3, 0)(4, 0)Thread Thread Thread Thread Thread(0, 1)(1, 1)(2, 1)(3, 1)(4, 1)Thread Thread Thread Thread Thread(0, 2)(1, 2)(2, 2)(3, 2)(4, 2)M02: High Performance Computing with CUDA6

Example: Increment Array ElementsIncrement N-element vector a by scalar bLet’s assume N 16, blockDim 4 - 4 blocksint idx blockDim.x * blockId.x threadIdx.x;blockIdx.x 0blockDim.x 4threadIdx.x 0,1,2,3idx 0,1,2,3blockIdx.x 1blockDim.x 4threadIdx.x 0,1,2,3idx 4,5,6,7M02: High Performance Computing with CUDAblockIdx.x 2blockDim.x 4threadIdx.x 0,1,2,3idx 8,9,10,11blockIdx.x 3blockDim.x 4threadIdx.x 0,1,2,3idx 12,13,14,157

Example: Increment Array ElementsCPU programCUDA programvoid increment cpu(float *a, float b, int N) global void increment gpu(float *a, float b, int N){{int idx blockIdx.x * blockDim.x threadIdx.x;for (int idx 0; idx N; idx )if( idx N)a[idx] a[idx] b;a[idx] a[idx] b;}}void main(){.increment cpu(a, b, N);}M02: High Performance Computing with CUDAvoid main(){ .dim3 dimBlock (blocksize);dim3 dimGrid( ceil( N / (float)blocksize) );increment gpu dimGrid, dimBlock (a, b, N);}8

Minimal Kernel for 2D dataglobal void assign2D(int* d a, int w, int h, int value){int iy blockDim.y * blockIdx.y threadIdx.y;int ix blockDim.x * blockIdx.x threadIdx.x;int idx iy * w ix;d a[idx] value;}M02: High Performance Computing with CUDA9

Blocks must be independentAny possible interleaving of blocks should be validpresumed to run to completion without pre-emptioncan run in any ordercan run concurrently OR sequentiallyBlocks may coordinate but not synchronizeshared queue pointer: OKshared lock: BAD can easily deadlockIndependence requirement gives scalabilityM02: High Performance Computing with CUDA10

Blocks must be independentThread blocks can run in any orderConcurrently or sequentiallyFacilitates scaling of the same code across many devicesScalabilityM02: High Performance Computing with CUDA11

Memory ModelLocal storageEach thread has own local storageData lifetime thread lifetimeShared memoryEach thread block has own shared memoryAccessible only by threads within that blockData lifetime block lifetimeGlobal (device) memoryAccessible by all threads as well as host (CPU)Data lifetime from allocation to deallocationHost (CPU) memoryNot directly accessible by CUDA threadsM02: High Performance Computing with CUDA12

Memory modelBlockThreadPer-threadLocal MemoryM02: High Performance Computing with CUDAPer-blockSharedMemory13

Memory modelKernel 0.Kernel 1SequentialKernelsPer-deviceGlobalMemory.M02: High Performance Computing with CUDA14

Memory modelDevice 0memoryHost memorycudaMemcpy()Device 1memoryM02: High Performance Computing with CUDA15

CUDA Programming BasicsM02: High Performance Computing with CUDA16

Outline of CUDA BasicsBasics to setup and execute CUDA code:Extensions to C for kernel codeGPU memory managementGPU kernel launchesSome additional basic features:Checking CUDA errorsCUDA event APICompilation pathSee the Programming Guide for the full APIM02: High Performance Computing with CUDA17

Code executed on GPUC function with some restrictions:Can only access GPU memoryNo variable number of argumentsNo static variablesMust be declared with a qualifier:global : launched by CPU,cannot be called from GPUmust return voiddevice : called from other GPU functions,cannot be launched by the CPUhost : can be executed by CPUhost and device qualifiers can be combinedsample use: overloading operatorsBuilt-in variables:gridDim, blockDim, blockIdx, threadIdxM02: High Performance Computing with CUDA18

Variable Qualifiers (GPU code)devicestored in global memory (not cached, high latency)accessible by all threadslifetime: applicationconstantstored in global memory (cached)read-only for threads, written by hostLifetime: applicationsharedstored in shared memory (latency comparable to registers)accessible by all threads in the same threadblocklifetime: block lifetimeUnqualified variables:Stored in local memory:scalars and built-in vector types are stored in registersarrays are stored in device memoryM02: High Performance Computing with CUDA19

Kernel Source Codeglobal void sum kernel(int *g input, int *g output){extern shared int s data[ ]; // allocated during kernel launch// read input into shared memoryunsigned int idx blockIdx.x * blockDim.x threadIdx.x;s data[ threadIdx.x ] g input[ idx ];syncthreads( );// compute sum for the threadblockfor ( int dist blockDim.x/2; dist 0; dist / 2 ){if ( threadIdx.x dist )s data[ threadIdx.x ] s data[ threadIdx.x dist ];syncthreads( );}// write the block's sum to global memoryif ( threadIdx.x 0 )g output[ blockIdx.x ] s data[0];}M02: High Performance Computing with CUDA20

Thread Synchronization Functionvoid syncthreads();Synchronizes all threads in a blockOnce all threads have reached this point, executionresumes normallyUsed to avoid RAW / WAR / WAW hazards whenaccessing shared memoryShould be used in conditional code only if theconditional is uniform across the entire threadblockM02: High Performance Computing with CUDA21

GPU Atomic Integer OperationsAtomic operations on integers in global memory:Associative operations on signed/unsigned intsadd, sub, min, max, .and, or, xorRequires hardware with 1.1 compute capabilityM02: High Performance Computing with CUDA22

Launching kernels on GPULaunch parameters:grid dimensions (up to 2D)thread-block dimensions (up to 3D)shared memory: number of bytes per blockfor extern smem variables declared without sizeOptional, 0 by defaultstream IDOptional, 0 by defaultdim3 grid(16, 16);dim3 block(16,16);kernel grid, block, 0, 0 (.);kernel 32, 512 (.);M02: High Performance Computing with CUDA23

GPU Memory Allocation / ReleaseHost (CPU) manages GPU memory:cudaMalloc (void ** pointer, size t nbytes)cudaMemset (void * pointer, int value, size t count)cudaFree (void* pointer)int n 1024;int nbytes 1024*sizeof(int);int * d a 0;cudaMalloc( (void**)&d a, nbytes );cudaMemset( d a, 0, nbytes);cudaFree(d a);M02: High Performance Computing with CUDA24

Data CopiescudaMemcpy( void *dst, void *src, size t nbytes,enum cudaMemcpyKind direction);returns after the copy is completeblocks CPU threaddoesn’t start copying until previous CUDA calls completeenum ceToHostcudaMemcpyDeviceToDeviceNon-blocking memcopies are providedM02: High Performance Computing with CUDA25

Host SynchronizationAll kernel launches are asynchronouscontrol returns to CPU immediatelykernel starts executing once all previous CUDA calls havecompletedMemcopies are synchronouscontrol returns to CPU once the copy is completecopy starts once all previous CUDA calls have completedcudaThreadSynchronize()blocks until all previous CUDA calls completeAsynchronous CUDA calls provide:non-blocking memcopiesability to overlap memcopies and kernel executionM02: High Performance Computing with CUDA26

Example: Host Code// allocate host memoryunsigned int numBytes N * sizeof(float)float* h A (float*) malloc(numBytes);// allocate device memoryfloat* d A 0;cudaMalloc((void**)&d A, numbytes);// copy data from host to devicecudaMemcpy(d A, h A, numBytes, cudaMemcpyHostToDevice);// execute the kernelincrement gpu N/blockSize, blockSize (d A, b, N);// copy data from device back to hostcudaMemcpy(h A, d A, numBytes, cudaMemcpyDeviceToHost);// free device memorycudaFree(d A);M02: High Performance Computing with CUDA27

Device ManagementCPU can query and select GPU devicescudaGetDeviceCount( int* count )cudaSetDevice( int device )cudaGetDevice( int *current device )cudaGetDeviceProperties( cudaDeviceProp* prop,int device )cudaChooseDevice( int *device, cudaDeviceProp* prop )Multi-GPU setup:device 0 is used by defaultone CPU thread can control one GPUmultiple CPU threads can control the same GPU– calls are serialized by the driverM02: High Performance Computing with CUDA28

CUDA Error Reporting to CPUAll CUDA calls return error code:except for kernel launchescudaError t typecudaError t cudaGetLastError(void)returns the code for the last error (no error has a code)char* cudaGetErrorString(cudaError t code)returns a null-terminted character string describing theerrorprintf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );M02: High Performance Computing with CUDA29

CUDA Event APIEvents are inserted (recorded) into CUDA call streamsUsage scenarios:measure elapsed time for CUDA calls (clock cycle precision)query the status of an asynchronous CUDA callblock CPU until CUDA calls prior to the event are completedasyncAPI sample in CUDA SDKcudaEvent t start, );cudaEventRecord(start, 0);kernel grid, block (.);cudaEventRecord(stop, 0);cudaEventSynchronize(stop);float et;cudaEventElapsedTime(&et, start, stop);cudaEventDestroy(start); cudaEventDestroy(stop);M02: High Performance Computing with CUDA30

Compiling CUDAC/C CUDAApplicationCPU CodeNVCCPTX CodeVirtualPTX to TargetPhysicalCompilerG80 GPUTarget codeM02: High Performance Computing with CUDA31

PTX Example (SAXPY ld.param.u32setp.le.u32@ p1 bra blockid, %ctaid.x;// Calculate i from thread/block IDs blocksize, %ntid.x; tid, %tid.x; i, blockid, blocksize, tid; n, [N];// Nothing to do if n i p1, n, i; L d.param.u32add.u32ld.global.f32 offset, i, 4;// Load y[i] yaddr, [Y]; yaddr, yaddr, offset; y i, [ yaddr 0]; xaddr, [X];// Load x[i] xaddr, xaddr, offset; x i, [ xaddr 0];ld.param.f32mad.f32st.global.f32 alpha, [ALPHA];// Compute and store alpha*x[i] y[i] y i, alpha, x i, y i;[ yaddr 0], y i; L finish:exit;M02: High Performance Computing with CUDA32

CompilationAny source file containing CUDA languageextensions must be compiled with nvccNVCC is a compiler driverWorks by invoking all the necessary tools and compilerslike cudacc, g , cl, .NVCC can output:Either C code (CPU Code)Must be compiled with a C compilerOr PTX object code directlyAn executable with CUDA code requires:The CUDA core library (cuda)The CUDA runtime library (cudart)if runtime API is usedloads cuda libraryM02: High Performance Computing with CUDA33

CUDA Development ToolsM02: High Performance Computing with CUDA34

GPU ToolsProfilerAvailable now for all supported OSsCommand-line or GUISampling signals on GPU for:Memory access parametersExecution (serialization, divergence)DebuggerRuns on the GPUEmulation modeCompile and execute in emulation on CPUAllows CPU-style debugging in GPU sourceM02: High Performance Computing with CUDA35

GPU ArchitectureM02: High Performance Computing with CUDA36

Block Diagram (G80 Family)G80 (launched Nov 2006)128 Thread Processors execute kernel threadsUp to 12,288 parallel threads activeHostInput AssemblerThread Execution ManagerThread ProcessorsThread ProcessorsPBSMPBSMPBSMPBSMThread ProcessorsThread ProcessorsThread ProcessorsThread ProcessorsThread ProcessorsThread PBSMLoad/storeGlobal MemoryM02: High Performance Computing with CUDA37PBSM

Streaming Multiprocessor (SM)Processing elementsSMt0 t1 tBMT IUSP8 scalar thread processors (SP)32 GFLOPS peak at 1.35 GHz8192 32-bit registers (32KB)½ MB total register file space!usual ops: float, int, branch, Hardware multithreadingup to 8 blocks resident at onceup to 768 active threads in total16KB on-chip memorySharedMemoryM02: High Performance Computing with CUDAlow latency storageshared among threads of a blocksupports thread communication38

Hardware MultithreadingHardware allocates resources to blocksSMMT IUSPblocks need: thread slots, registers, sharedmemoryblocks don’t run until resources are availableHardware schedules threadsthreads have their own registersany thread not waiting for something can runcontext switching is free – every cycleSharedMemoryHardware relies on threads to hide latencyi.e., parallelism is necessary for performanceM02: High Performance Computing with CUDA39

SIMT Thread ExecutionGroups of 32 threads formed into warpsSMMT IUSPalways executing same instructionshared instruction fetch/dispatchsome become inactive when code path divergeshardware automatically handles divergenceWarps are the primitive unit of schedulingSIMT execution is an implementation choiceSharedMemorysharing control logic leaves more space for ALUslargely invisible to programmermust understand for performance, not correctnessM02: High Performance Computing with CUDA40

Blocks Run on MultiprocessorsKernel launched by host.Device processor arrayMT IUMT IUMT IUMT edMemoryMT IU.MT IUMT IUMT edMemoryDevice MemoryM02: High Performance Computing with CUDA41

Tesla T10Thread Processor(TP)Multi-bankedRegister FileFP/IntThread Processor Array(TPA)Special Function Unit (SFU)Double PrecisionSpcOpsALUsTP Array Shared Memory240 SP thread processors30 DP thread processorsFull scalar processorIEEE 754 double precisionfloating pointM02: High Performance Computing with CUDA42

control returns to CPU immediately kernel starts executing once all previous CUDA calls have completed Memcopies are synchronous control returns to CPU once the copy is complete copy starts once all previous CUDA calls have completed cudaThreadSynchronize() blocks until all previous CUDA calls complete Asynchronous CUDA calls provide: non .