Evolving Use Of GPU For Dassault Systems Simulation Products - NVIDIA

Transcription

13DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012Evolving Use of GPU forDassault Systemes SimulationProductsLuis CrivelliAndMatt Dunbar

3DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012Dassault Systémes is dedicated to making Realistic Simulationan integralbusiness practiceto nd,Improveproduct, life, & nature2Improve

3DS.COM/SIMULIA Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012Simulation with Abaqus3Finite Element SimulationAbaqus/Standard – static structural simulationsAbaqus/Explicit – short term dynamic simulations“Predictive Crashworthiness Simulation in a Virtual Design Process without Hardware Testing”,Jurgen Lescheticky, Hariaokto Hooputra and Doris Ruckdeschel, BMW Group, SIMULIACustomer Conference, May 2010

3DS.COM/SIMULIA Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012Simulation runtimes4Simulation is a valuable part of an engineering design process, butcomputational cost is significantBottle StackingAutomotive Crash Routine static analysis Compute time of 4-6hours on 4 x86 cores Complex dynamics model Compute time of 2-5 daysusing between 16 and 32x86 coresStent ExpansionGasket Sealing Complex static analysis Compute time of 12-24hours using between 8and 32 x86 cores Large static model Compute time of 2-5days using between 32and 64 x86 coresReducing compute cost iscritical!

Demand for Higher Accuracy4540353025209080Ability to solve moreaccurate and morecomplex simulations wasenabled between 2005 andpresent by cluster70605040153010205100200820065100Expect that use of GPGPU forcompute will accelerate trendtowards efficient execution of20092007201020080Number of customers runningmodels with more than 5M DoFsActual Customer Model DoFs(Millions)3DS.COM/SIMULIA Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 201250

3DS.COM/SIMULIA Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012Simulation compute architecture6Computational work is distributed to many cores on multiple servers by splitting a modelinto domains which are assigned to cores to parallelize computationsNew codeallows x86cores tooffload keycomputework ontoone or moreNvidia Teslacards toacceleratecomputation

Abaqus3DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012Existing Cluster-based ArchitectureAbaqus applications allow users to exploit X86clusters to decrease runtimesCan GPU deliver a faster or moreefficient PCIRAM7PCIRAMPCIRAMRAM

Approaches to Exploiting GPU3DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012RAM8Offload Code to Card X86 cores are idle while GPU cardHybrid Mode X86 cores and GPU card are usedprocessessimultaneously with X86 assigningappropriate work to GPUX86GPUPCI busX86GPURAMGPU As Platform Very limited “control” compute onX86; bulk of work done on GPU cardX86GPU/MIC

Abaqus Solvers3DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012Key Code ComponentsCode ComponentNature of CodeLinear Equation SolverIncreases with problem size.Dense linear algebra kernelsand “control code.” Relativelysmall amount of code withhigh computational cost.Finite ElementsMost significant cost otherthan equation solver.Naturally parallel but code isnot written to expose SIMDparallelism.ElementsTypically 50%-75% of cost.Naturally parallel. Code iswritten for SIMD architectures.Constraints and otherMuch of remaining cost.Complex parallelism.Abaqus/StandardAbaqus/ExplicitUses GPU in Abaqus 6.11 and 6.129Cost of CodeFocus of prototyping work

Abaqus/Standard13 Million DOF Powertrain Analysis, 6 Iterations,1.1E 14 FLOPS per factorization480ABAQUS/Standard420Direct SolverWall Clock Time (minutes)3DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012Time in Equation Solver360300240Natural target for GPUacceleration1801206001 host, 8 cores10As noted on previous slideAbaqus/Standard equationsolver is a significant cost inexecution of a simulation8 hosts, 64 cores 16 hosts, 128 cores

Abaqus/Standard3DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012Single GPU/Single Server11RAMSpeedup5.7X4.1X2.8X1.8XRAMAt lower core counts GPU is more effectivebecause the accelerated code takes a higherpercentage of the overall time (codeaccelerated by the GPU is also effectivelyparallelized on X86 cores)Beyond 4 cores, X86 processors startoverwhelming the GPU with computation

Abaqus/StandardPCI1400RAMSpeedup with GPU1.31000Speedup relative to no GPU caseRAMRAMGPUJobs run on 2 hosts eachwith 12 cores, 48 GB ofRAM, and 2 Nvidia GPU’sProblem 1800600GPU1.41.41.712971.410381.99044001 GPU2 GPU9491.5697692560502200No GPU3752.45373482210Problem 1 Total12RAM1200Time (seconds)3DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012Cluster and multi-GPU support – hybrid modePCIProblem 2 TotalProblem 1 SolverProblem 2 Solver 1.5 MDOF 5.37 teraflops per iterationProblem 2 3.4 MDOF 10.8 teraflops per iteration

Abaqus/Explicit3DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012GPU as a platform using Portland Group (OpenACC) compiler13Explicit finite element codes do not havea natural bottleneckLoop over groups of elementsCompute cost is spread through many100’s of routinesLoop A – process 1 to ngroup elementsLoop B – process 1 to ngroup elementsCode does have a natural SIMD structureSIMULIA has done a prototype in whichthe Portland Group compiler was used tobuild existing X86 code for GPULoop C – process 1 to ngroup elementsProcess group 231Parallelization is done at a fine-grainedlevel. Generally relies on elements beingprocessed being identical (SingleInstruction Multiple Data)

Abaqus/Explicit Prototype3DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012X86 Socket14PCI busGPUPCIRAMRAMGPUData transfer – 230 msC3D8R Element CodeXeon X5680 – 1 core – 357 msXeon X5680 – 6 cores – 70 msNvidia Tesla – 21 msData transfer – 230 msData cannot be transferred butmust be left on GPU

3DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012Memory bandwidth bounds GPU performance15Given peak performance of GPU near 1 TFlop vs peak performance for Westmere of 70 Gflopswhy does Explicit code sped up by only 3X?do k 1, groupsizetemp1(k) in1(k) * in2(k)ComputationalUnitCacheL1L2L3do k 1, groupsizetemp2(k) temp1(k) * in3(k)do k 1, groupsizeout(k) temp2(k) * xHigh degree of data parallelism, but eachpiece of data is used a small number oftimes in operationsOn X86 system temporary arrays remain in cache,but on GPU limited local memory is sharedbetween a large number of threads so temporarydata ends up be written back to system memory.ComputationalUnitLocal Memory768 KB 150 GB/sec 25 GB/secRAMGPU SystemMemory

3DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012Conclusions16Use of GPU as an X86 accelerator in a hybrid mode is effective and continuedarea of developmentInvestigations into GPU as a platform are ongoingExperience with OpenACC compiler approach has been goodHighly data parallel code does have bandwidth issues on GPU which limits gainsData management between X86 memory and GPU is a key topic

173DS.COM Dassault Systèmes Confidential Information 5/16/2012 ref.: 3DS Document 2012

Approaches to Exploiting GPU Offload Code to Card X86 cores are idle while GPU card processes X86 GPU PCI bus X86 GPU Hybrid Mode X86 cores and GPU card are used simultaneously with X86 assigning appropriate work to GPU X86 GPU/MIC GPU As Platform Very limited "control" compute on RAM RAM 9 OM written for SIMD architectures. s 2 2