Benchmarking Huawei ARM Multi-Core Processors For HPC Workloads - Linaro

Transcription

Benchmarking Huawei ARMMulti-Core Processorsfor HPC workloadsKey LiaoCenter for HPCShanghai Jiao Tong UniversityJan 9th, 2019

About MeKey Liao (廖秋承)B.S. from Environment Science and Engineering, SJTU.HPC Engineer of Center for High Performance Computing, SJTU.Leader of ARM Research Team at CHPC, SJTU.Supervisor of SJTU Student HPC Competition Team.Main Research Area:Computer ArchitectureTheoretical ComputerPerformance EvaluationPerformance OptimizationEmail: keymorrislane@sjtu.edu.cn

Outline Kunpeng 920 Float-point Arithmetic Memory subsystem Proxy Applications TeaLeaf SNAP CloverLeaf Real-world applications GTC-P

Chips Informationgrp 0core 0core 1core 3core 4grp 1grp 2grp 3grp 4grp 5

Chips InformationModelIntel XeonGold 6148Hi1616Kunpeng 920(Engineering ain Frequency(GHz)2.42.42.0Num of Cores203248Vectorization heoretical DP PeakPerformance (GFLOPS)*1536307.2768L3 Cache1.375 MB32MB(shared)64MB(shared)DRAM Support6 x DDR4-26664 x DDR4-24008 x DDR4-3200TDPLaunch Time15020177020161502019* Theoretical DP peak performance is calculated based on the frequency we test during chips running theirbest vectorization instruction set.

Platform InformationPlatform61481616920CPUXeon Gold 6148Hi1616Kunpeng 920Number of Sockets448DRAM Size (GB)2048256256DRAM Frequency (MHz)266624002666LinuxCentOS 7.5Kernel 3.10.0EulerOSKernel 4.11.0EulerOSKernel 4.14.0CompilerMPI LibraryBLAS LibraryGNU/GCC-8.2.0All withIntel Parallel StudioXE Cluster Version2019 Update 1(Education License)MVAPICH2-2.3OpenBLAS 0.3.5

Float-point Arithmetic 41.1% Better than Hi1616, compared to a 165.3% increase from Haswell toSkylake in 3 years. HPL efficiency on Kunpeng 920 is around 40% compared to more than 70%on other chips.HPL Benchmark on Four Platforms2500HPL Efficiency on Four 0.0%10.0%026836148Single Socket1616Dual Socket9200.0%26836148Single Socket1616Dual Socket920

Float-point Arithmetic Hi1616 128-bit SIMD SP: 614.4 Gflops DP: 307.2 GflopsHi1620 128-bit SIMD SP: 1,536 Gflops DP: 384 GlopsThroughput of DP SIMD instruction is limited. Not a good chip for intense DP computation. DP computation is not so important as peopleused to think. Trend on SVE and VLA .FMA Instruction ThroughputSP ScalarDP ScalarSP VectorHi16162ins/cycle 9.596GFlops 2ins/cycle 9.596GFlops 1ins/cycle 19.194GflopsKunpeng920 2ins/cycle 7.989Gflops 2ins/cycle 7.989Gflops 2ins/cycle 31.954GFlopsDP Vector1ins/cycle 9.596GFlops1ins/cycle 7.989Gflops

Memory SubsystemNormalized Average Latency 25xNormalized Bandwidth of Different Memory Layers614892043.6Relative Scale3.532.521.512.12.01.61.61 1.111 1.2111.21110.50L1 ReadL1 WriteL2 ReadL2 WriteL3 ReadL3 WriteDRAMReadDRAMWrite

Chip Communication - .012.7

Proxy Applications SNAP A proxy application for a modern deterministic discrete ordinates transportcode TeaLeaf Proxy app for solving the linear heat conduction equation on a spatiallydecomposed regular grid, utilising a five point finite difference stencil CloverLeaf Solving Euler’s equations of compressible fluid dynamics, under a LagrangianEulerian scheme, on a two-dimensional spatial regular structured grid.

Proxy Applications - ResultsSNAP Grind Time10.90.80.70.60.50.40.30.20.102-Socket Strong Scaling2-Socket Weak ScalingSingle0.916 0.912-Socket Strong cket Strong ScalingSingle2-Socket Strong 67600Wall Lock (s)Wall Lock (s)920CloverLeaf-bm128 short1600800364.42200614814002-Socket Weak Scaling1200Wall Lock (s)Grind Time 61489206148920

Proxy Applications - ResultsNormalized Performance of Proxy Applications onSingle rLeaf bm166148CloverLeaf bm128 short920Normalized Performance of Proxy Applications on CloverLeaf bm166148920CloverLeaf bm128 short

Proxy Applications - SNAPSingle12-Socket Strong Scaling2-Socket Weak ScalingSame single node performace0.9160.910.90.80.770.76626.9% Speedup0.7Grind Time (ns) Generally, load a relative big dataset. Performing random access inthe data set.(dim3 sweep.f90) If OpenMP is enable, threadingacross data set. MPI Recv becomes a hotspotafter scaling across socket.SNAP Grind Time(9600 cells, nang 64, ng 332, nstep 100)0.560.60.4620.50.40.30.20.106148920

Proxy Applications - TeaLeaf Memory subsystem bandwidth. 3840 x 3840, 10000 steps.TeaLeaf Relative SpeedupSingle2-Socket Strong Scaling2-Socket Weak Scaling1.89x21.77x1.81.6Wall Lock (s)1.41.21.0x1.0x10.600x0.607x0.80.60.40.206148920

Proxy Applications - CloverLeaf Memory subsystem bandwidth. But Double-float arithmetic intensity increases as the number of cellsincreases and the total number of iteration decreases.CloverLeaf-bm16SingleCloverLeaf-bm128 shortSingle2-Socket Strong .90x182.581041.561000755.5571.67600Wall Lock (s)Wall Lock (s)12008002-Socket Strong Scaling150120.65109.8100400502000061489206148920

GTC-P GTC-P: Gyrokinetic Toroidal Code - Princeton GTC-P is Particle-in-Cell code that delivers fusion simulations at extremescales on the worldwide supercomputers including Tianhe-2, Titan,TaihuLight and etc., that feature CPU, GPU and many-core processors.Supported byNSF SAVI Project

GTC-PKunpeng920

GTC-PGTC-P Performance With Different Combination ofProcesses and Threads on Kunpeng 920

GTC-PKunpeng 920

Conclusion Kunpeng 920 is capable to finish those scientific computation whichhas relatively low arithmetic intensity ( 4 dp F/B) better than Intel'srecent chip which has similar price. Pro Good Topology designs for threading High bandwidth, low latency, do well in many memoty-bound apps. Con Low bandwidth of Hydra Interface. Low DP arithmetic capability.

TeaLeaf Proxy app for solving the linear heat conduction equation on a spatially decomposed regular grid, utilising a five point finite difference stencil CloverLeaf Solving Euler's equations of compressible fluid dynamics, under a Lagrangian-Eulerian scheme, on a two-dimensional spatial regular structured grid. Proxy Applications