GPU Computing - CUDA PDF Free Download

1y ago

21 Views

1 Downloads

9.93 MB

104 Pages

Report/dmca

Download PDF

Transcription

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughGPU Computing - CUDAA short overview of hardware and programing modelPierre Kestener11 CEA Saclay, DSM, Maison de la SimulationParis, July 24, 2014INFIERI 2014, Univ. Paris-Diderot, Paris1 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughContentShort historical view: from graphics processor to GPU acceleratorMain differences between CPU and GPUCUDA Hardware : differences with CPUCUDA software abstraction / programing modelSIMT - Single Instruction Multiple ThreadMemory hierarchyreference: Axel Koehler, NVIDIA2 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughSummary1Historical perspective2CUDA Hardware / Software3CUDA Code walkthrough3 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughGPU evolution: before 2006, i.e. CUDAGPU dedicated hardware for graphics pipelineGPU main function : off-load graphics task from CPU to GPUGPU: dedicated hardware for specialized tasks“All processors aspire to be general-purpose.”– Tim Van Hook, Graphics Hardware 20012000’s : shaders (programmable functionalities in the graphicspipeline) : low-level vendor-dependent assembly, high-level Cg,HLSL, etc.Legacy GPGPU (before CUDA, 2004), premises of GPU computingThe Evolution of GPUs for General Purpose Computing,par Ian 75 GTC2010.pdf4 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughFloating-point computation capabilities in GPU ?Floating point computations capability implemented inGPU hardwareIEEE754 standard written in mid-80sIntel 80387 : first floating-point coprocessor IEEE754-compatibleValue ( 1)S M 2E , denormalized, infinity, NaN; roundingalgorithms quite complex to handle/implementFP16 in 2000FP32 in 2003-2004 : simplified IEEE754 standard, float pointrounding are complex and costly in terms of transistors count,CUDA 2007 : rounding computation fully implemented for and * in2007, denormalised number not completed implementedCUDA Fermi : 2010 : 4 mandatory IEEE rounding modes;Subnormals at full-speed (Nvidia GF100)links:http://homepages.dcc.ufmg.br/ sylvain.collange/talks/raim11 scollange.pdf5 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughGPU computing - CUDA hardware - 2006CUDA : Compute Unified Device ArchitectureNvidia Geforce8800, 2006, introduce a unified architecture (only onetype of shader processor)first generation with hardware features designed with GPGPU inmind: almost full support of IEEE 754 standard for single precisionfloating point, random read/write in external RAM, memory cachecontrolled by softwareCUDA new hardware architecture new programming model/software abstraction (a C-likeprogramming language development tools : compiler, SDK,librairies like cuFFT)6 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughMoore’s law - the free lunch is over.The number of transistors that can be placedinexpensively on an integrated circuit doublesapproximately every two years7 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughMoore’s law - the free lunch is over.Moore’s Law continues withtechnology scaling (32 nm in 2010, 22 nm in 2011),improving transistor performance to increase frequency,increasing transistor integration capacity to realize complexarchitectures,reducing energy consumed per logic operation to keep powerdissipation within limit.Shekhar Borkar, Thousand Core Chips - A Technology Perspective, in Intel Corp, Microprocessor Technology Lab, 2007, p. 1-49 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughMoore’s law - Towards multi-core architecturesPollack’s rule - Wide adoption of multi-core architecturesif you double the logic in a processor core, then it delivers only 40%more performanceA multi-core microarchitecture has potential to provide near linearperformance improvement with complexity and power.For example, two smaller processor cores, instead of a largemonolithic processor core, can potentially provide 70-80% moreperformance, as compared to only 40% from a large monolithiccoreShekhar Borkar, Thousand Core Chips - A Technology Perspective, in Intel Corp, Microprocessor10 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughMoore’s law - Towards multi-core architecturesMore transistors more computing power !More transistors ? What’s the purpose ? How to use themefficiently ?Improve single-core CPU performances:/ keep frequency increasing (watch electric power !), keep transistor density increasing (more and more difficult) : 32 nmin 2010Utilize efficiently transistors on chip/ instruction-level parallelism (out-of-order execution, etc.), data-level parallelism (SIMD, vector units) : SSE, Cell Spe, GPU !, thread-level parallelism: hardware-multi-threading, multi-core,many-core .http://www.ugrad.cs.ubc.ca/ cs448b/2010-1/lecture/2010-09-09-ugrad.pdf11 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughWhat is a supercomputer ?Supercomputer: A computing system exhibiting high-endperformance capabilities and resource capacities within practicalconstraints of technology, cost, power, and reliability. Thomas Sterling,2007.Supercomputer: a large very fast mainframe used especially forscientific computations. Merriam-Webster Online.Supercomputer: any of a class of extremely powerful computers. Theterm is commonly applied to the fastest high-performance systemsavailable at any given time. Such computers are used primarily forscientific and engineering work requiring exceedingly high-speedcomputations. Encyclopedia Britannica Online.12 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughSupercomputers architectures - TOP500A Supercomputer is designed to be at bleeding edge of current technology.Leading technology paths (to exascale) using TOP500 ranks(Nov. 2013)Multicore: Maintain complex cores, and replicate (x86, SPARC,Power7) (#4 and 10)Manycore/Embedded: Use many simpler, low power cores fromembedded (IBM BlueGene) (#3, 5, 8 and 9)GPU/MIC/Accelerator: Use highly specialized processors fromgaming/graphics market space (NVidia Fermi, Cell, Intel Phi (MIC) ),(# 1, 2, 6 and 7)13 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughSupercomputers architecturesMultiples levels of hierarchy:Need to aggregate the computing power of several 10 000 nodes !network efficiency: latency, bandwidth, topologymemory: on-chip (cache), out-of-chip (DRAM), IO (disk)emmerging hybrid programming model: MPI Multi-threadFigure : Multi-core node summarysource: multicore tutorial (SC12) by Georg Hager and Gerhard Wellein14 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughWhat is a supercomputer ?Figure : Horst Simon, LBNL15 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughSummary1Historical perspective2CUDA Hardware / Software3CUDA Code walkthrough16 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughFrom multi-core CPU to manycore GPUArchitecture design differences between manycore GPUs and generalpurpose multicore CPU ?Different goals produce different designs:CPU must be good at everything, parallel or notGPU assumes work load is highly parallel17 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughFrom multi-core CPU to manycore GPUArchitecture design differences between manycore GPUs and generalpurpose multicore CPU ?CPU design goal : optimize architecture for sequential codeperformance : minimize latency experienced by 1 threadsophisticated (i.e. large chip area) control logic for instruction-levelparallelism (branch prediction, out-of-order instruction, etc.)CPU have large cache memory to reduce the instruction and dataaccess latency18 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughFrom multi-core CPU to manycore GPUArchitecture design differences between manycore GPUs and generalpurpose multicore CPU ?GPU design goal : maximize throughput of all threads# threads in flight limited by resources lots of resources (registers,bandwidth, etc.)multithreading can hide latency skip the big cachesshare control logic across many threads19 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughFrom multi-core CPU to manycore GPUArchitecture design differences between manycore GPUs and generalpurpose multicore CPU ?GPU takes advantage of a large number of execution threads to findwork to do when other threads are waiting for long-latency memoryaccesses, thus minimizing the control logic required for eachexecution thread.20 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughNvidia Fermi hardware (2010)Streaming Multiprocessor (32 cores), hardware control, queuingsystemGPU scalable array of SM (up to 16 on Fermi)warp: vector of 32 threads, executes the same instruction inlock-stepthroughput limiters: finite limit on warp count, on register file, onshared memory, etc.21 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughNvidia Fermi hardware (2010)CUDA Hardware (HW) key conceptsHardware thread managementHW thread launch and monitoringHW thread switchingup to 10 000’s lightweight threadsSIMT execution modelMultiple memory scopesPer-thread private memory : (register)Per-thread-block shared memoryGlobal memoryUsing threads to hide memory latencyCoarse grained thread synchronization23 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA - connecting program and execution modelNeed a programing model to efficiently use such hardware; alsoprovide scalabilityProvide a simple way of partitioning a computation into fixed-sizeblocks of threads24 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA - connecting program and execution modelTotal number of threads must/need be quite larger than number ofcoresThread block : logical array of threads, large number to hide latencyThread block size : control by program, specify at runtime, better bea multiple of warp size (i.e. 32)25 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA - connecting program and execution modelMust give the GPU enought work to do ! : if not enough threadblocks, some SM will remain idleThread grid : logical array of thread blocks distribute work amongSM, several blocks / SMThread grid : chosen by program at runtime, can be the total numberof thread / thread block size or a multiple of # SM26 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA : heterogeneous programmingheterogeneous systems : CPU andGPU have separated memoryspaces (host and device)CPU code and GPU code can be inthe same program / file(pre-processing tool will performseparation)the programmer focuses on codeparallelization (algorithm level) noton how he was to schedule blocksof threads on multiprocessors.27 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA : heterogeneous programmingCurrent GPU execution flowreference: Introduction to CUDA/C, GTC 201228 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA : heterogeneous programmingCurrent GPU execution flowreference: Introduction to CUDA/C, GTC 201229 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA : heterogeneous programmingCurrent GPU execution flowreference: Introduction to CUDA/C, GTC 201230 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA : programming model (PTX)a block of threads is a CTA(Cooperative Thread Array)Threads are indexed inside a block;use that index to map memorywrite a program once for a threadrun this program on multiplethreadsblock is a logical array of threadsindexed with threadIdx (built-invariable)grid is a logical array of blocksindexed with blockIdx (built-invariable)31 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughSummary1Historical perspective2CUDA Hardware / Software3CUDA Code walkthrough32 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA C/C CUDA C/C Large subset of C/C languageCPU and GPU code in the same file; preprocessor to filter GPU specificcode from CPU codeSmall set of extensions to enable heterogeneous programming: newkeywordsA runtime/driver APIMemory management: cudaMalloc, cudaFree, .Device management: cudaChooseDevice, probe device properties (# SM,amount of memory , .)Event management: profiling, timing, .Stream management: overlapping CPU-GPU memory transfert withcomputations, .TerminologyHost: CPU and its memoryDevice: GPU and its memorykernel: routine executed on GPU33 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA : C-language extensions and run-time APIFunction and type Var;HostFunc(.);//////////kernel callable from hostfunction callable on devicevariable in device memoryshared in PDC by thread blockfunction callable on hostbuilt-in variables : threadIdx and blockDim, blockIdx and gridDim(read-only registers)kernel function launch syntaxKernelFunc 500, 128 (.); // launch 500 blocks w/ 128 threads each« . » is used to set grid and block sizes (can also set shared memsize per block)synchronisation threads inside blocsyncthreads(); // barrier synchronization within kernellibc-like routine (e.g.: memory allocation, CPU/GPU data transfer, .)34 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA Code walkthroughData parallel modelUse intrinsic variables threadIdx and blockIdx to create amapping between threads and actual data/** nvcc -m64 -gencode arch compute 20,code sm 20 --ptxas-options -v* -o scalarAdd scalarAdd.cu*/# include stdio.h /*** a simple CUDA kernel** \param[inout] a input integer pointer* \param[in]b input integer* \param[in]n input array size*/global void add( int *a, int b, int n ) {int idx threadIdx.x blockIdx.x*blockDim.x;if (idx n)a[idx] a[idx] b;}37 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA Code walkthroughData parallel modelUse intrinsic variables threadIdx and blockIdx to create amapping between threads and actual data/** main*/int main( void ) {// array sizeint N 16;// host variablesint *a; int b;// device variablesint *dev a;// CPU memory allocationa (int *) malloc(N*sizeof(int));b N;// CPU memory initializationfor (int i 0; i N; i ) a[i] i;// GPU device memory allocationcudaMalloc( (void**)&dev a,N*sizeof(int) ) ;// GPU device memory initializationcudaMemcpy( dev a, a, N*sizeof(int),cudaMemcpyHostToDevice ) ;// perform computation on GPUint nbThreads 8;dim3 blockSize(nbThreads,1,1);dim3 gridSize((N 1)/nbThreads,1,1);add gridSize,blockSize ( dev a, b, N );// get back computation result// into host CPU memorycudaMemcpy( a, dev a, N*sizeof(int),cudaMemcpyDeviceToHost ) ;38 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA Code walkthroughData parallel modelUse intrinsic variables threadIdx and blockIdx to create amapping between threads and actual data// do something !// de-allocate CPU host memoryfree(a);// de-allocate GPU device memorycudaFree( dev a ) ;cudaDeviceSynchronize();cudaDeviceReset();return 0;}39 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA memory hierarchy: software/hardwarehardware (Fermi) memory hierarchyon chip memory : low latency, fine granularity, small amountoff-chip memory : high latency, coarse granularity (coalescenceconstraint, .), large amountshared memory: kind of cache, controlled by user, data reuse inside athread blockneed practice to understand how to optimise global memorybandwidth40 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA memory hierarchy: software/hardwaresoftware memory hierarchyregister : for variables private to a threadshared : for variables private to a thread block, public for all threadinside blockglobal : large input data buffer41 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughPerformance tuning thoughtsThreads are freeKeep threads short and balancedHW can (must) use LOTs of threads (several to 10s thousands) to hidememory latencyHW launch near zero overhead to create a threadHW thread context switch near zero overhead schedulingBarriers are cheapsingle instruction: syncthreads();HW synchronization of thread blocksGet data on GPU, and let them there as long as possibleExpose parallelism: give the GPU enough work to doFocus an data reuse: avoid memory bandwidth limitationsref: M. Shebanon, NVIDIA42 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughOther subjective thoughtstremendous rate of change in hardware from cuda 1.0 to 2.0(Fermi) and 3.0/3.5 (Kepler)CUDA HW versionFeatures1.0basic CUDA execution model1.3double precision, improved memory accesses,atomics2.0 (Fermi)Caches (L1, L2), FMAD, 3D grids, ECC,P2P (unified address space), funtion pointers, recurs3.5 (Kepler GK110) 1 Dynamics parallelism, object linking,GPU Direct RemoteDMA, new instructions,read-only cache, Hyper-Qmemory constraint like coalescence were very strong in cuda HW 1.0 large perf drop in memory access pattern was not coaslescentObtaining functional CUDA code can be easy but optimisationmight require good knowledge of hardware (just to fullyunderstand profiling information)1 as seen in slides CUDA 5 and Beyond from GTC201243 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughGPU computing - OpenCLOpenCL Open Computing Languagestandard http://www.khronos.org, version 1.0 (12/2008)focus on portability: programming model for GPU (Nvidia/ATI),multicore CPU and other: FPGA (Altera/Xilinx)OpenCL programming model use most of the abstract concepts ofCUDA: grid of blocks of threads, memory hierarchy, .Tutorial:http://www.cc.gatech.edu/ vetter/keeneland/tutorial-2011-04-14/06-intro to opencl.pdf44 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughGPU computing - OpenCLOpenCL language based on C99 with restrictionsSome difficulties:HW vendor must provide their OpenCL implementation; AMD isleading with OpenCL 1.2multiple SDK with vendor specific addons breaks portabilityBoth CUDA and OpenCL provide rather low-level API; but OpenCL’slearning curve is steeper at first. Lots of low level code to handledifferent abstract concepts: plateform, device, command queue, .;Need to probe hardware, .45 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughGPU computing - OpenCLOpenCL language based on C99 with restrictionsSome difficulties:architecture-aware optimisation breaks portability: NVIDIA andAMD/ATI hardware are different require different optimisationstrategy46 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughGPU computing - OpenCLporting a CUDA program to OpenCL ages/OpenCL- and- the- ATI-Stream-v2.0- Beta.aspx#fourtools to automate conversion CUDA/OpenCL: SWAN, CU2CL (aCUDA to OpenCL source-to-source translator); not sure these toolsare really used by a large set of users.other tools: MCUDA (a CUDA to OpenMP source-to-sourcetranslator)47 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughGPU computing - OpenCLQuick / partial comparison of CUDA / OpenCL; seeGPU software blog by AcceleyesPerformance: both can fully utilize the hardware; but might behardware dependant (across multiple CUDA HW version), algorithmdependent, etc . Use benchmark SHOC to get an idea.Portability: CUDA is NVIDIA only (but new LLVM toolchain, alsoOcelot provides a way from PTX to other backend targets likex86-CPU or AMD-GPU); OpenCL is an industry standard (run onAMD, NVIDIA, Intel CPU)Community: larger community for CUDA; HPC supercomputer withNVIDIA hardware; Significantly larger number of existingapplications in CUDAThird party libraries: NVIDIA provides good starting points for FFT,BLAS or Sparse linear algebra which makes their toolkit appealing atfirst.Other hardware: some embedded device applications in OpenCL(e.g. Android OS); CUDA will probably be used for Tegra apps; CUDAtoolkit for ARM (project MontBlanc)48 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughvector addition in OpenCLBuild as regular CPUapplication:gcc-I CUDA ROOT/include -ovectorAdd vectorAdd.c-lOpenCLOpenCL kernel are C-strings orcontained in fileslibOpenCL implementationembeds the actual opencl-vector-addition/49 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughFPGA computing - OpenCLFPGA : more flexibility in the design 12/10/opencl ieee cs 1012.pdfhttps://www.youtube.com/watch?v dZEPjhwOOtk50 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughInstall CUDA / OpenCLYou must have a compatible with CUDA. Personal example :GeForce GTX 560 Ti (hardware 2.0)Comparison of Nvidia GPU features:http://en.wikipedia.org/wiki/Comparison of Nvidia graphics processing unitsexample almost up-to-date system:OS: 12.10 - x86 64 (October 2013)CUDA version: 5.5.22Since 2013, Nvidia driver toolkit sdk available as Deb or RPM packageTips for Ubuntu 12.10sudo dpkg -i cuda-repo-ubuntu1210 5.5-0 amd64.deb; this willsetup nvidia repository (write file/etc/apt/sources.list.d/cuda.list)then you can install tools by sudo apt-get install cuda55 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughInstall CUDA / OpenCLNvidia Linux kernel driver: version 319.37Deb package nvidia-current; builds kernel modulenvidia current.koInstalls CUDA Driver API library libcuda.so, libnvidia-opencl.so,etc .misc: Nvidia codec libraries: nvcuvid, nvcuvenccompiling toolchain (aka toolkit):provides nvcc CUDA/C compilerCUDA RunTime API library libcudart.soProfiling tool: nvvp (can be used inside nsight)IDE - custom version of Eclipse named nsightScientific libraries: cuBlas, cuFFT, cuRand, cuSparse, etc.CUDA Documentation56 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA compatible GPUIs the Nvidia graphics driver loaded ? Have a look at proc filesystem:/proc/driver/nvidiaMonitoring nvidia GPU; nvidia-smiWhat are the GPU features ? Run CUDA sample deviceQueryheader cuda runtime api.h : cudaGetDeviceCount,cudaGetDevicePropertiesWhat is the installed driver version ?cat /proc/driver/nvidia/version; nvidia-smiGetForce GTX 285Tesla C106057 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughSDK CUDAMost SDK samples follow roughly the same template (see projecttemplate in 0 Simple subdir) :project name.cu : contains main and entry point to GPUcomputation (a call to a CUDA kernel)project name kernel.cu : definition of some CUDA kernelproject name gold.c : native CPU version, for comparison orperformance benchmarkA fem important examples for pedagogical reasonstranspose : efficient use of device memory bandwidth, memorycoalescence, shared memory, bank conflictreduction : efficient use of device memory bandwidthOne can mix CUDA code for device in a regular C/CPP source file,provided it is protected by macro CUDACC58 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA development toolsCUDA toolkit nvcc compiler runtime libraryany source code file with CUDA language extensions (.cu) needs to becompiled with nvccNVCC is a compiler driverWorks by invoking all the necessary tools and compilers like cudacc,g , cl, .NVCC can output:PTX (Parallel Thread eXecution) codeobject code directlyAn executable with CUDA code requires:The CUDA core library (cuda)The CUDA runtime library (cudart)59 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCUDA compiler: nvccCUDA toolkit nvcc compiler runtime libraryImportant flags:-arch sm 13, sm 20, sm 35 Enable double precision ( oncompatible hardware)-G Enable debug for device code-ptxas-options -v Show register and memory usage-use fast math Use fast math library (single precision only)-maxrregcount N Limit the number of registersExample:nvcc gencode arch compute 20 , code sm 20 ptxas options v o scalarAdd scalarAdd . cuptxas info: Compiling entry function ’ Z3addPiii ’ f o r ’sm 20 ’ptxas info: Function properties f o r Z3addPiii0 bytes stack frame , 0 bytes s p i l l stores , 0 bytes s p i l l loads60 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughCuda profiling / performance analysis toolscommand-line profiler; export COMPUTE PROFILE 1; exportCOMPUTE PROFILE CONFIG some config file (config file specifywhich hardware performance counter your want to be reported)GUI profiler: nvvpNew in CUDA 5.0: nvprof (can be used as gprof)other performance analysis tools: TAU PAPI CUPTI (see extra slidesat the end)reference:slides: GPU Performance Analysis and Optimization by P. Micikevicius(Nvidia)video (much better) o/nvid6/player61 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughGPU : floating point computation supportFloating point computations capability implemented in GPUhardwareIEEE754 standard written in mid-80sIntel 80387 : first floating-point coprocessor IEEE754-compatibleValue ( 1)S M 2E , denormalized, infinity, NaN; roundingalgorithms quite complex to handle/implementFP16 in 2000FP32 in 2003-2004 : simplified IEEE754 standard, float pointrounding are complex and costly in terms of transistors count,CUDA 2007 : rounding computation fully implemented for and * in2007, denormalised number not completed implementedCUDA Fermi : 2010 : 4 mandatory IEEE rounding modes;Subnormals at full-speed (Nvidia GF100)links: im11 scollange.pdf62 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughGPU : floating point computation supportFloating point computations capability implemented in GPUhardwareWhy are my floating results different on GPU from CPU ?cannot expect always the same results from different hardwareone algorithm, but 2 different software implementations for 2 differenthardware CPU and GPUHardware differences:CPU floating point unit (FPU) might use x87 instructions (with 80 bitsprecision used internally); SSE operations use 32 or 64 bits values.Compiler dependency: values kept in register ? or spilled to externalmemory ?Fused Multiply-Add from IEEE754-2008a b c in a single rounding step; better precisionimplemented on all GPU from CUDA hardware 2.0 but not all CPU(AMD/Bulldozer/2011, Intel/Haswell/2013,.)use nvcc flags -fmad false to tell compiler not to use FMAD instructionssee code snipnet in hands-onCUDA doc : Floating Point on NVIDIA GPU White Paper.pdf63 / 104

Historical perspectiveCUDA Hardware / SoftwareCUDA Code walkthroughGPU : floating point computation supportFloating point computations capability implemented in GPUhardwareWhy are my floating results different on GPU from CPU ?parallel computations rearrange operations; associativity((a b) c 6 a (b c)) different the results; see reduction exampleDifferent ? By how much ? Can be very hard to distinguish differencescoming from parallel computation from a genuine bug in thealgorithm. When comparing a reference CPU algorithm with the GPUported algorith

Thomas Sterling, 2007. Supercomputer: a large very fast mainframe used especially for scientiﬁc computations. Merriam-Webster Online. Supercomputer: any of a class of extremely powerful computers. The term is commonly applied to the fastest high-performance systems available at any given time. Such computers are used primarily for