Parallel Programming In OpenCL - University Of Cambridge

Transcription

Parallel programming in OpenCLAdvanced Graphics & Image ProcessingRafał MantiukComputer Laboratory, University of Cambridge

Single Program Multiple Data (SPMD)Consider the following vector addition example for( i 0:11 ) {C[ i ] A[ i ] B[ i ]}Serial program:one program completesthe entire taskA B CMultiple copies of the same program execute on different data in parallelSPMD program:multiple copies of thesame program run ondifferent chunks of thedata2for( i 0:3 ) {C[ i ] A[ i ] B[ i ]}for( i 4:7 ) {C[ i ] A[ i ] B[ i ]}for( i 8:11 ) {C[ i ] A[ i ] B[ i ]}A B CFrom: OpenCL 1.2 University Kit - ams/

Parallel Software – SPMDIn the vector addition example, each chunk of data couldbe executed as an independent threadOn modern CPUs, the overhead of creating threads is sohigh that the chunks need to be large In practice, usually a few threads (about as many as the numberof CPU cores) and each is given a large amount of work to doFor GPU programming, there is low overhead for threadcreation, so we can create one thread per loop iteration 3From: OpenCL 1.2 University Kit - ams/

Parallel Software – SPMD loop iterationSingle-threaded (CPU)// there are N elementsfor(i 0; i N; i )C[i] A[i] threaded (CPU)// tid is the thread id// P is the number of coresfor(i 0; i tid*N/P; i )C[i] A[i] B[i]Massively Multi-threaded (GPU)// tid is the thread idC[tid] A[tid] B[tid]4T0T1T2T3T0T1T2T30T1515123From: OpenCL 1.2 University Kit - ams/15

Parallel programming frameworks These are some of more relevant frameworks forcreating parallelized codeGPUCPUOpenCLCUDAOpenMPOpenACCMetal

OpenCL OpenCL is a framework for writing parallelized code forCPUs, GPUs, DSPs, FPGAs and other processorsInitially developed by Apple, now supported by AMD, IBM,Qualcomm, Intel and Nvidia (reluctantly)Versions Latest: OpenCL 2.2 OpenCL C kernel languageSPIR-V as intermediate representation for kernels Vulcan uses the same Standard Portable Intermediate RepresentationAMD, IntelMostly supported: OpenCL 1.2 Nvidia, OSX

OpenCL platforms and drivers To run OpenCL code you need: Generic ICD loader Installable Client Driver Included in the OSFrom Nvidia, Intel, etc.This applies to Windows and Linux, only one platform on MacTo develop OpenCL code you need: OpenCL headers/libraries Included in the SDKs Nvidia – CUDA ToolkitIntel OpenCL SDKBut lightweight options are also available

Programming OpenCL OpenCL natively offers C99 APIBut there is also a standard OpenCL C API wrapper Strongly recommended – reduces the amount of codeProgramming OpenCL is similar to programming shadersin OpenGL Host code runs on CPU and invokes kernelsKernels are written in C-like programming language In many respects similar to GLSLKernels are passed to API as strings and compiled at runtime Kernels are usually stored in text filesKernels can be precompiled into SPIR from OpenCL 2.1

Example: Step 1 - Select deviceGet allPlatformsSelectPlatformGet allDevicesSelectDevice

Example: Step 2 - Build programCreatecontextLoad sources(usually from files)CreateProgramBuildProgram

Example: Step 3 - Create Buffers andcopy memoryCreateBuffersCreateQueueEnqueueMemory Copy

Example: Step 4 - Execute Kernel andretrieve the resultsCreateKernelSet KernelArgumentsEnqueueKernelOur Kernel wasEnqueuememory copy

OpenCL API Class Diagram Platform – Nvidia CUDADevice – GeForce 780Program – collection ofkernelsBuffer / Image – devicememorySampler – how tointerpolate values forImageCommand Queue – put asequence of operationsthereEvent – to notify thatsomething has been doneFrom: OpenCL API 1.2 Reference Card

Platform model The host is whatever the OpenCL library runs on Devices are processors that the library can talk to Usually x86 CPUs for both NVIDIA and AMDCPUs, GPUs, DSP,s and generic acceleratorsFor AMD 14All CPUs are combined into a single device (each core is a compute unitand processing element)Each GPU is a separate device

Execution model Each kernel executes on 1D, 2D or 3D array (NDRange)The array is split into work-groupsWork items (threads) in each work-group share some localmemoryKernel can querry get global id(dim)get group id(dim)get local id(dim)Work items are notbound to any memoryentity(unlike GLSL shaders)

Memory model Host memory Global memory [ global] Device memory, for storing largedataConstant memory [ constant]Local memory [ local] Usually CPU memory, device doesnot have access to that memoryFast, accessible to all work-items(threads) within a workgroupPrivate memory [ private] Accessible to a single work-item(thread)

Memory age2Dcl::Image2DThis diagram is incomplete – there are more memory objects Buffer ArrayBuffer in OpenGLAccessed directly via C pointersImage Texture in OpenGLAccess via texture look-up functionCan interpolate values, clamp, etc.

Programming model Data parallel programming Task-parallel programming Each NDRange element is assigned to a work-item (thread)Multiple different kernels can be executed in parallelEach kernel can use vector-types of the device (float4, etc.)Command queuequeue.enqueueWriteBuffer(buffer A, CL TRUE, 0, sizeof(int)*10, A);CL TRUE - Execute in-orderCL FALSE – Execute out-of-order Provides means to both synchronize kernels and execute them in parallel

Big Picture19

Thread Mapping By using different mappings, the same thread can beassigned to access different data elements The examples below show three different possible mappings ofthreads to data (assuming the thread id is used to access anint group size element)get local size(0) *get local size(1);MappingThread IDs20int tid get global id(1) *get global size(0) get global id(0);int tid get global id(0) *get global size(1) get global id(1);int tid get group id(1) *get num groups(0) *group size get group id(0) *group size get local id(1) *get local size(0) get local 537111589121310111415From: OpenCL 1.2 University Kit - ams/*assuming 2x2 groups

Thread Mapping Consider a serial matrix multiplication algorithm This algorithm is suited for output data decomposition We will create N x M threads Each thread will perform P calculations Effectively removing the outer two loopsThe inner loop will remain as part of the kernelShould the index space be MxN or NxM?21From: OpenCL 1.2 University Kit - ams/

Thread Mapping Thread mapping 1: with an MxN index space, the kernel would be:Mapping for C 0481215913261014371115Thread mapping 2: with an NxM index space, the kernel would be:Mapping for C 0123456789101112131415Both mappings produce functionally equivalent versions of the program22From: OpenCL 1.2 University Kit - ams/

Thread Mapping This figure shows the execution of the two thread mappingson NVIDIA GeForce 285 and 8800 GPUs Notice that mapping 2 is far superior in performance for bothGPUs23From: OpenCL 1.2 University Kit - ams/

Thread Mapping The discrepancy in execution times between themappings is due to data accesses on the global memorybus Assuming row-major data, data in a row (i.e., elements inadjacent columns) are stored sequentially in memoryTo ensure coalesced accesses, consecutive threads in the samewavefront should be mapped to columns (the seconddimension) of the matrices 24This will give coalesced accesses in Matrices B and CFor Matrix A, the iterator i3 determines the access pattern for rowmajor data, so thread mapping does not affect itFrom: OpenCL 1.2 University Kit - ams/

Reduction GPU offers very goodperformance for tasksin which the results arestored independently Process N data itemsand store in N memorylocationfloat reduce sum(float* input, int length){float accumulator input[0];for(int i 1; i length; i )accumulator input[i];return accumulator;}But many common operations require reducing N values into 1 or few values sum, min, max, prod, min, histogram, Those operations require an efficient implementation of reduction The following slides are based on AMD’s OpenCL Optimization Case Study: Simple Reductions ions/

Reduction tree for the min operationkernelvoid reduce min( global float* buffer,local float* scratch,const int length,global float* result) { int global index get global id(0);int local index get local id(0); // Load data into local memoryif (global index length) {scratch[local index] buffer[global index];} else {scratch[local index] INFINITY;}barrier(CLK LOCAL MEM FENCE);for(int offset get local size(0) / 2;offset 0; offset 1) {if (local index offset) {float other scratch[local index offset];float mine scratch[local index];scratch[local index] (mine other) ? mine :other;}barrier(CLK LOCAL MEM FENCE);}if (local index 0) {result[get group id(0)] scratch[0];}}barrier ensures that all threads(work units) in the local groupreach that point before executioncontinueEach iteration of the for loopcomputes next level of thereduction pyramid

Multistage reduction The local memory is usuallylimited (e.g. 50kB), whichrestricts the maximum size ofthe array that can be processedTherefore, for large arrays needto be processed in multiplestages The result of a local memoryreduction is stored in the arrayand then this array is reduced

Two-stage reductionkernelvoid reduce( global float* buffer,local float* scratch,const int length,global float* result) { First stage: serial reduction byN concurrent threads Number of threads data itemsSecond stage: parallel reductionin local memoryint global index get global id(0);float accumulator INFINITY;// Loop sequentially over chunks of inputvectorwhile (global index length) {float element buffer[global index];accumulator (accumulator element) ?accumulator : element;global index get global size(0);}// Perform parallel reduction[The same code as in the previous example]}

Reduction performance CPU/GPU Different reduction algorithm may be optimal for CPU and GPUThis can also vary from one GPU to anotherThe results from: ons/

Better way? Halide - a language for image processing andcomputational photography http://halide-lang.org/Code written in a high-level language, then translated tox86/SSE, ARM, CUDA, OpenCLThe optimization strategy defined separately as a scheduleAuto-tune software can test thousands of schedules andchoose the one that is the best for a particular platformAutomatically find the best trade-offsfor a particular platformDesigned for image processing butsimilar languages created for otherpurposes

OpenCL resources https://www.khronos.org/registry/OpenCL/Reference cards Google: “OpenCL API Reference Card”AMD OpenCL Programming Guide MD Accelerated Parallel Processing OCL Programming Guide-2013-06-21.pdf

Parallel Software –SPMD . Vulcan uses the same Standard Portable Intermediate Representation AMD, Intel Mostly supported: OpenCL 1.2 Nvidia, OSX. OpenCL platforms and drivers To run OpenCL code you need: Generic ICD loader Included in the OS Installable Client Driver From Nvidia, Intel, etc. This applies to Windows and Linux, only one platform on Mac To develop OpenCL code you need: