GPU Acceleration Of HFSS Transient - NVIDIA

Transcription

GPU Acceleration of HFSS TransientHsueh-Yung (Robert) Chao, StylianosDosopoulos, and Rickard PeterssonANSYS, Inc.1 2011 ANSYS, Inc.March 19, 2015

Outline2 ANSYS overview HFSS transient solvers Why use graphic processing units (GPUs) foracceleration? Optimization of CUDA programs for HFSS Transient Distributed solve on multiple GPUs Conclusions 2011 ANSYS, Inc.March 19, 2015

Our Vision3 2011 ANSYS, Inc.March 19, 2015

Our Strategy4 2011 ANSYS, Inc.March 19, 2015

HFSS Transient Solvers R13 (2011) General-purpose hybrid (implicit-explicit) hp-adaptive finite-element solverfor transient electromagnetics Superior to finite-difference (FDTD, FIT) for solving multiscale problems Explicit part - discontinuous Galerkin time domain (DGTD) method withlocal time stepping Implicit part – dual-field Crank-Nicolson Locally implicit to alleviate the restriction of small time steps of explicit DG OpenMP multithreading121,764 tets, Intel X56758 CPU cores, 893 MBDRAM, 3 hrs 14 mins5 2011 ANSYS, Inc.March 19, 2015

HFSS Transient Solvers R15 (2013) Target applications with electrically large structuresand high-order meshes GPU-accelerated DGTD with local time stepping One process on one GPU Multiple processes on multiple GPUs for parametricsweeps and network analysis with multipleexcitations6 2011 ANSYS, Inc.March 19, 2015

HFSS Transient Solvers R16 (2015) Target applications with electrically smallstructures and low-frequency signals Immune to bad meshes Single-field fully implicit with Newark-beta(β 1/4) Similar memory scalability to frequencydomain HFSS OpenMP multithreading7 2011 ANSYS, Inc.March 19, 2015*Touch screen and helicopter simulations courtesy of Jack Wu and Arien Sligar

Why Use GPUs for Acceleration? The large number of ALUs in GPUs is especially favorable for massivelyparallel processing. DGTD is inherently highly parallel. Its locality property makes DGTD mapefficiently to the GPU architecture models. A GPU provides much higher memory BW and FLOPS than a CPU.FEM MeshGPU/DeviceTB1TB2Parallelismlevel 1:Thread BlockParallelismlevel 2:Nodal basesParallelismlevel 3:8 2011 ANSYS, Inc.threadMarch 19, 2015CUDA MODEL

Why Use GPUs for Acceleration? (Cont.)4ΔtLi level i of local time-stepping2ΔtΔtTime step at Li is 2iΔtL4L4L4L3L3L3L2L2L2L1L1L1L0L0L0CPU0Load for DGTD fieldupdate at each time stepfor all LTS levels. Fieldupdates across LTS levelare interdependent &cannot be parallelized.9 2011 ANSYS, Inc.March 19, 2015CPU1CPU2CPU3Load not evenly distributed on eachCPU core if the mesh is divided basedon equal number of elements. CPUcores are frequently idle at some LTSlevels. Scalability can be poor.Load is evenly distributedto GPU threads blocks foreach LTS level. Parallelefficiency is limited bythe lowest LTS level withfew elements.

GPU Code OptimizationKey feature 1: CUDA is SIMT. Peak performance requires that all threads ina warp execute the same instruction, i.e. no divergent execution paths.Clock cycles- HardwarerunningThread blocktimeWarps(threads)-Softwarewarp warp warpWarps(warp 32 threads)Divergent pathif (id 4){}else{}serializationWarp schedulersIns W1LatencyIns W2LatencyIns W3NOTE: Avoiding divergent paths may notalways be possibleIns W1LatencyIns W2Ins W3time10 2011 ANSYS, Inc.March 19, 2015

GPU Code Optimization (Cont.)Key feature 2: In CUDA instructions are executed per warp (32 threads).Therefore, we should utilize as much of the 32 threads (good granularity).Optimized GranularitywarpE1More sharedmemory usage.Nodal basesE1E2Key feature 3: Carefully choose various memory spaces with differentbandwidths and latencies.In our implementationregisters, shared, constant, globalslower/larger Re-used variables should be stored in registers. If not enough register space use shared memoryfor reused variables. Global memory coalesced access patterns.11 2011 ANSYS, Inc.March 19, 2015MemoryVariablesregistersLocalsharedFlux Gather, localconstantReference elementsglobalTime-Stepping Vectors,Element Matrices, Maps

GPU Code Optimization (Cont.)Key feature 4: Avoid GPU hardware “limiting factors” with regard to registers,shared, warps and thread blocks.Physical Limits for GPUCompute Capability:Threads per WarpWarps per MultiprocessorThreads per MultiprocessorThread Blocks per MultiprocessorTotal # of 32-bit registers perMultiprocessorRegister allocation unit sizeRegister allocation granularityRegisters per ThreadShared Memory per Multiprocessor(bytes)Shared Memory Allocation unit sizeWarp allocation granularityMaximum Thread Block Size12 2011 ANSYS, Inc.March 19, 2015Example 1. Limited by registers :3.5326420481665536256warp2554915225641024 32 threads per block with255 registers per thread. Blocks 65536 /(32*255) 8(max 16).Example 2. Limited by shared mem: 500 DP values per threadblock. Blocks 49152/(500*8) 12.288(max 16).warpE1E2

Nsight Performance Analysis Tools 13Use NVIDIA-Nsight tool as a guideline in applying allpreviously mentioned optimizations.Through Nsight analysis our CUDA code was optimized with2-3x improvement over the non-optimized case. 2011 ANSYS, Inc.March 19, 2015

GPU Speedup: One Tesla K20c vs. Xeon X5675 8 CoresGPU Speed Up6543210Series114BenchMBoradBenchMF35Ant( F35Ant( GSMark2 FlipChip( Cauer(1. Dipole v with DiffVia(3 ark2(200Dipole v PecMine36ns)60 36.5ns)8 Antenna(500ps)0 30ps), 75ns), 16(30ns) Traces, 0 ps),ps)00MHz, 00MHz, , 133K16, 21K , 32K39K40K, 59K115K 12GHz,-10GHz,92K125K330K(p 2)36K500ns108K1.83 2011 ANSYS, Inc.2.220.950.275March 19, order(p 1,2), 75K2.96ApacheSSubstaticateringon, 443K300MHz500ns, 387K3.520.65Note: HFSS Transient detects cases not suitable for GPU acceleration and falls back to CPUs.

UHF Blade Antenna on F-35262,970 tets, fmax 800 MHzTesla K40c vs. Xeon E5-2687W 8CPU cores, 2.3 GB GPU RAM, GPUSpeedup 3.2x15 2011 ANSYS, Inc.March 19, 2015

Transient Analysis of a Smart PhoneTransient field analysis on CPU, memory,GPS, USB, and Bluetooth ports due topower surge during battery chargingBluetooth portmultibandantennaUSBCPUSIM cardTouch screen panel16 2011 ANSYS, Inc.GPSantenna1,093,376 tets, fmax 5 GHzTesla K40c vs. Xeon E5-2687W 8CPU cores, 6.0 GB GPU RAM, GPUSpeedup 4.8xMarch 19, 2015*Phone model courtesy of Sara Louie

Mutual Coupling between Patch Antennas833,218 tets, fmax 1.2 GHzTesla K40c vs. Xeon E5-2687W 8CPU cores, 6.7 GB GPU RAM, GPUSpeedup 6.8xS1117 2011 ANSYS, Inc.March 19, 2015S12 ,S13*Helicopter model courtesy of Matt Commens

Cosite Interference of LTE Monopoles on A3202,120,263 tets, fmax 1.7 GHzTesla K40c vs. Xeon E5-2687W 8CPU cores, 10.9 GB GPU RAM,GPU Speedup 6.3x18 2013 ANSYS, Inc.March 19, 2015*A320 model courtesy of Matt Commens

Acceleration on Multiple GPUs Automatic job assignment for parametric sweeps and network analysis withmultiple excitations Speedup scales linearly with respect to the number of GPUs Auto detection of GPUs attached to displays and exclude them from GPUaccelerationCPU monitoring byWindows TaskManagerGPU monitoring bynvidia-smi19 2011 ANSYS, Inc.March 19, 2015

Performance Gains on Multiple GPUs Transient Network analysis with 64 excitations, speedup of GPU is 2.0xfor 1 GPU vs. 8 CPU cores, each workstation can host up to 4 GPUs and 16CPU cores, a simulation for one excitation using 8 CPU cores takes 1 hr.1 HPC pack2 HPC packs3 HPC packs4 HPC packs20 1 GPU 4 GPUs 16 GPUs 64 GPUs 8 CPU cores 32 CPU cores 128 CPU cores 512 CPU cores# ofWorkstations# of HPCLicenses# of CPU Cores# of GPUsSimulationTime 256640.5128 2011 ANSYS, Inc.March 19, 2015*Actual speedup may vary depending on system configurations.

Conclusions DGTD is a good candidate for GPU acceleration due to itsinherent parallelism. The desired goal of 2x over 8 CPU cores is successfullyachieved. Explicit GPU vs. Hybrid CPU21 Line-by-line optimization is necessary in order to achieve theperformance goals. Memory access patterns are critical for GPU acceleration. Rethinking algorithms to expose more parallelism cansignificantly improve performance. 2013 ANSYS, Inc.March 19, 2015

3 HPC packs 16 GPUs 128 CPU cores 4 HPC packs 64 GPUs 512 CPU cores # of Workstations # of HPC Licenses # of CPU Cores # of GPUs Simulation Time (Hours) Speedup* 1 1 16 0 32 2 1 1 16 1 32 2 1 2 16 4 8 8 4 3 64 16 2 32 16 4 256 64 0.5 128 *Actual speedup may vary depending on system configurations. .