Arm In HPC - Stony Brook University

Transcription

Arm in HPCContact: john.linford@arm.com

CPU Engagement Models with ArmArm IP is the basic building block for extraordinary solutions.Core License Partner licenses completemicroarchitecture. CPU differentiation via: Configuration options. Wide implementationenvelope with differentprocess technologies.2Confidential 2019 Arm LimitedArm IPArchitecture License Partner designs completemicroarchitecture. Clean room, scratch. Maximum design freedom: Directly address needs of thetarget market. Arm architecture validationpreserves softwarecompatibility

Arm Neoverse Momentum in Servers & HPC2018Arm Neoverse Announced2019VMware demonstratedESXi on 64-bit ARMNvidia CUDA stack toArm PlatformsNeoverse N1 & E1Platform announcedMarvell's ThunderX2 Solutionfor Microsoft Azure Development2ndAWS announcedGenerationArm-based Graviton2 Server CPUMarvell Announced 96core ThunderX3 Server SoCHuawei released Kunpeng 920 CPUand TaiShan server platform2020Ampere announced industry’s1st 80-Core Server SoC (128 Altra Max)EPI Zeus LicenseFujitsu Fugaku Riken#1 on Top5003Confidential 2019 Arm Limited

Fujitsu/RIKEN Fugaku: Fastest Supercomputer in the WorldTop place in 4 categories:Top500@ 416 Pflop/sHPCG@ 13.4 Pflop/sHPL-AI@ 1.42 Eflop/sGraph 500 @ 70980 GTEPS4Confidential 2019 Arm Limited2.8 x4.6 x2.6 x3x

Vanguard Astra by HPE 2,592 HPE Apollo 70 compute nodes 5,184 CPUs, 145,152 cores, 2.3 PFLOPs (peak) 112 36-port edges, 3 648-port spineswitches Marvell ThunderX2 ARM SoC, 28 core, 2.0 GHz Red Hat RHEL for Arm Memory per node: 128 GB (16 x 8 GB DRDIMMs) HPE Apollo 4520 All–flash Lustre storage Aggregate capacity: 332 TB, 885 TB/s (peak)6 Mellanox IB EDR, ConnectX-5Confidential 2019 Arm Limited Storage Capacity: 403 TB (usable)Storage Bandwidth: 244 GB/s

http://gw4.ac.uk/isambard/Isambard system specification 10,752 Armv8 cores (168n x 2s x 32c) Cavium ThunderX2 32core 2.1 2.5GHz Cray XC50 ‘Scout’ form factor High-speed Aries interconnect Cray HPC optimised software stack CCE, Cray MPI, math libraries, CrayPAT, Phase 2 (the Arm part): 7Delivered Oct 22nd, handed over Oct 29thAccepted Nov 9thUpgrade to final B2 TX2 silicon, firmware, CPEcompleted March 15th 2019Confidential 2019 Arm Limited

Isambard 2 production system 21,504 Armv8 cores (168n x 2s x 32c) Marvell ThunderX2 32 core @2.5GHz Cray XC50 ‘Scout’ form factor High-speed Aries interconnect Cray HPC optimised software stack 8Compilers, math libraries, CrayPAT, Also comes with all the open source softwaretoolchains: GNU, Clang/LLVM etc.Confidential 2019 Arm Limited

Isambard 2’s A64fx Apollo80 system 72 nodes, 3,456 cores, 1.8GHz 72 TB/s memory bandwidth202 TFLOP/s double precision Connected with 100Gbps InfiniBand Comes with a Cray software stack CCE, Armclang, GNU Hope to add the Fujitsu compiler9Confidential 2019 Arm Limited

CEA : Deployment by ATOS 292 Atos Sequana X1310 compute nodes 584 CPUs, 18,688 cores Marvell ThunderX2 ARM SoC, 32 cores, 2.2 GHz Memory : 8 channels, DDR4 2666, 256 GB Mellanox InfiniBand EDR Peak Performance 329 TFLOPS HPL 84% of efficiency HPCG 3.47 of HPL10Confidential 2019 Arm Limited

AWS Graviton2 - an Arm Server Processor11Graviton ProcessorGraviton2 ProcessorFirst Arm-based processoravailable in major cloud7x performance, 4x compute cores,and 5x faster memoryBuilt on 64-bit Arm Neoverse coreswith AWS-designed silicon using16nm manufacturing technologyBuilt with 64-bit Arm Neoversecores with AWS-designed siliconusing 7nm manufacturingtechnologyUp to 16 vCPUs, 10Gbpsenhanced networking, 3.5GbpsEBS bandwidthUp to 64 vCPUs, 25Gbpsenhanced networking, 18GbpsEBS bandwidthConfidential 2019 Arm Limited

AWS Graviton 2 for HPC workloadsThe c6g instances have outstanding price/performance as compared to similar x86instances The AWS Graviton 2 implements the Arm Neoverse N1 Up to 40% improved price/performance over x86 instancesCost: lower is better12Confidential 2019 Arm LimitedRun time: lower is better

Software EcosystemConfidential 2019 Arm Limited

ApplicationsOpen-source, owned, commercial ISV codes, OEM/ODM’sCray-HPE, ATOS-Bull,Fujitsu, Gigabyte, SiliconSuppliersMarvell, Fujitsu,Mellanox, NVIDIA, CompilersLibrariesFilesystemsArm, GNU, LLVM, Clang, Flang,Cray, PGI/NVIDIA, Fujitsu, ArmPL, FFTW, OpenBLAS,NumPy, SciPy, Trilinos, PETSc,Hypre, SuperLU, ScaLAPACK, BeeGFS, Lustre, ZFS,HDF5, NetCDF, OSRHEL, SUSE, CentOS, Ubuntu, Arm Server Ready PlatformStandard firmware and RASCluster ManagementMellanox IB/OFED/HPC-X, OpenMPI, MPICH, MVAPICH2, OpenSHMEM, OpenUCX, HPE MPIBright, HPE CMU, xCat, Warewulf, MiddlewareSchedulersSingularity, PodMan, Docker, Python, Arm Forge (DDT, MAP),Rogue Wave, HPC Toolkit,Scalasca, Vampir, TAU, SLURM, IBM LSF, Altair PBS Pro, Containers, Interpreters, etc.Debuggers &Profilers

A Rich and GrowingApplication therCFDVisualizationGenomicsAI/ML

GNU and LLVM ToolchainsToolchains for all Arm cores – supported at releaseStatus: LTS Linux distributions support Arm CPU features when a CPU becomes generally available Improve performance for key user workloads and industry benchmarksGNU Toolchain (compilers, debuggers, libraries, etc.) Default compiler in Linux distributions like RedHat, SUSE, UbuntuKey segments: Cloud, networking and HPCLLVM Toolchain (compilers, debuggers, libraries, etc.) 16Default compiler in Android and the basis for commercial compilers (including Arm and Cray compilers)Key segments: Mobile (Android/iOS), CloudConfidential 2019 Arm Limited

Example: SVE SupportOver four years of active, ongoing development Arm actively posting SVE open source patches upstream Beginning with first public announcement of SVE at HotChips 2016 Available upstream Since GNU Binutils-2.28 Since GCC 8: Since LLVM 7: Since QEMU 3: Since GDB 8.2Released Feb 2017, includes SVE assembler & disassemblerFull assembly, disassembly and basic auto-vectorizationFull assembly, disassemblyUser space SVE emulationHPC use cases fully included Constant upstream review LLVM: Linux kernel:Since Nov 2016, as presented at LLVM conferenceSince Mar 2017, LWN article on SVE supportAutomatic Arm support in latest version of all tools – peer to x8617Confidential 2019 Arm Limited

Example: Auto-vectorization in LLVM Auto-vectorization via LLVM vectorizers: Use cost models to drive decisions about what code blocks can and/or should be vectorized.Since October 2018, two different vectorizers used from LLVM: Loop Vectorizer and SLP Vectorizer. Loop Vectorizer support for SVE and NEON: 18Loops with unknown trip countRuntime checks of pointersReductionsInductions“If” conversionConfidential 2019 Arm Limited Pointer induction variablesReverse iteratorsScatter / gatherVectorization of mixed typesGlobal structures alias analysis

Server & HPC Development Solutions from ArmCommercially supported tools for Linux and high performance computingCode GenerationPerformance EngineeringServer & HPC Solutionfor Arm serverscross platform, scalablefor Arm serversCOMPILERFOR LINUXC/C CompilerDebuggerFortran CompilerProfilerPerformance Libraries19Confidential 2019 Arm LimitedReportingCommercially Supported Toolkitfor applications development on Linux C/C Compiler for LinuxFortran Compiler for LinuxPerformance LibrariesPerformance ReportsDebugger (DDT)Profiler (MAP)

Arm Compiler for Linuxa.k.a Arm Compiler for HPC, a.k.a. Arm Allinea CompilerTuned for Scientific Computing, HPC and Enterprise workloadsCompilers tuned for ScientificComputing and HPC Processor-specific optimizations for various server-class Arm-based platformsOptimal shared-memory parallelism using latest Arm-optimized OpenMP runtimeLinux user-space compiler with latest featuresLatest features andperformance optimizations C 14 and Fortran 2003 language support with OpenMP 4.5Support for Armv8-A and SVE architecture extensionBased on LLVM and Flang, leading open-source compiler projectsCommercially supported by Arm Commercially supportedby Arm20 2019 Arm LimitedAvailable for a wide range of Arm-based platforms running leading Linuxdistributions – RedHat, SUSE and Ubuntu

Building on LLVM, Clang and Flang projectsArm C/C /Fortran CompilerClang basedC/C Files(.c/.cpp)C/C FrontendFortran Files(.f/.f90)FortranFrontendLanguage specific frontend21 2019 Arm LimitedArmv8-Acode-genOptimizerLLVM IRPGI Flang basedLLVM basedLLVM basedIR OptimizationsAuto-vectorizationEnhanced optimization forArmv8-A and SVELanguage agnostic optimizationArmv8-AbinaryLLVM IRLLVM basedSVEcode-genArchitecture specific backendSVEbinary

Arm Performance LibrariesOptimized BLAS, LAPACK and FFTCommercial 64-bit Armv8-A math librariesCommercially supportedby Arm Commonly used low-level math routines - BLAS, LAPACK and FFTProvides FFTW compatible interface for FFT routinesSparse linear algebra and batched BLAS supportlibamath gives high-performing math.h functions implementationsBest-in-class serial and parallel performanceBest in class performance Generic Armv8-A optimizations by ArmTuning for specific platforms like Marvell ThunderX2 in collaboration with siliconvendorsValidated and supported by ArmValidated withNAG test suite22 2019 Arm Limited Available for a wide range of server-class Arm-based platformsValidated with NAG’s test suite, a de-facto standard

Arm Performance Libraries – Leading BLAS performancePercentage of peakArm Compiler for Linux 20.0 vs latest OpenBLAS vs latest BLIS1009080706050403020100OpenBLASBLISArm PL 20.0023 2019 Arm Limited200040006000Matrix dimensions M N K8000 High serialperformance forBLAS level 3routines, such asGEMMs alsohave classleading parallelperformance Shown isDGEMM onsquare matricesusing 56 threadson a ThunderX2

Arm Performance Libraries: OpenMP Scaling on N1Run on AWS Graviton2 Shown is DGEMM on squarematrices using 64 threads onan AWS Graviton2 Shown for matrix sizes of100, 1,000 and 10,000 Shows up to 85.7%efficiency for large matrices24 2019 Arm Limited

ArmPL 20.0 FFT vs FFTW 3.3.8Arm PL Speed-up over FFTW6 1-d complex-tocomplex doubleprecision ThunderX2543Arm Perf Libs betterthan FFTW(speed-up 1)21Performance parity(speed-up 1)FFTW better than ArmPerf Libs(speed-up 1)0110100Transform length25 2019 Arm Limited1000

Arm Performance Libraries – Optimized Math RoutinesOpen Source: Normalised runtimeArmPL includes libamath and libastring1,2 1 Enabled by default with Arm Compiler for LinuxDouble precision implementations of: erf(), erfc() single and double precision implementations of:exp(), pow() , log(), log10() single precision implementations of:sin(), cos(), sincos() Efficient memory/string functions from string.h Enable autovectorization of math and stringroutines by adding –armpl or -fsimdmath.more to come. 0,80,60,40,20WRFCloverleafGCC26Algorithmically better performance than standardlibrary callsNo loss of accuracy 2019 Arm LimitedArmOpenMXArm libamathBranson

Build ToolsAll popular build tools are supported on ArmSupportCompilation Performance All major build systems and tools: CMake, Make, GNUMake, Spack etc. Spack used internally at Arm. Arm supports KitWare etc. to ensure build toolslike CMake are stable and supported. Arm upstreams any necessary changes tosupport Arm’s commercial tools. e.g. CMake toolchain files for Arm Compilers. A data point: ThunderX2 compilation of largecode bases is on is on-par with Intel Skylake Usually faster due to higher core counts. GNU compilers run faster than LLVM, but that’snot aarch64-specific; same on any arch.27Confidential 2019 Arm Limited

Application Build Recipes and SpackSpack is used extensively by Arm Multiple places for recipes rks Want to move our knowledge base into Spack https://github.com/spack/spackWould like customers to also contribute to Spack Ideally get package owners to update their code28Confidential 2019 Arm Limited

MPI ImplementationsOut-of-the-box support for Arm in the latest versions of OpenMPIMPICHMVAPICH Out-of-the-box support since3.1.2 (currently 4.0.4) developer.arm.com guide Upstream contributions Used inhouse Basis of Bull, Mellanox andFujitsu K implementations Active development fromArm and Arm partners Basis of Cray and Intelimplementations.and MVAPICH 29Confidential 2019 Arm Limiteddeveloper.arm.com guideUpstream contributionsUsed inhouseBasis of Sunway TaihuLightimplementation Arm investment in OSU Arm hardware & tools

Parallel Runtime EnvironmentsThreading, thread placement, and affinityDynamically linked libraries and page size POSIX threads 2.0 fully supported. Thread placement, pinning, affinity via hwloc,numactl, etc. Most SoCs support a simple memory hierarchypartitioned into a minimal number of NUMAnodes, e.g. one NUMA node per CPU socket. The goal is to minimize code refactoring forperformance and eliminate “guess and check”data movement optimization strategies. Users do not need to change anything in theirexecution environment or workflow to achievegood performance. Demonstrated at multiple application scalesat several sites including Sandia and Bristol. Tools like LLNL’s Spindle are supported to reduceI/O pressure when loading dynamically linkedapplications.30Confidential 2019 Arm Limited

Scientific Computing /categories/libraryPackage SupportTesting and DevelopmentResourcing Trilinos, PETSc, Hypre,SuperLU, ScaLAPACK,NetCDF, HDF5, BLIS, etc. Tested to work well with Armand GNU compilers. 54 packages in Arm’sCommunity Packages Wiki ThunderX2 access freelyavailable for open sourceproject CI/CD packet.net Verne Global Arm supports communitiesas part of broader NRE andcommercial projects Arm provides reactivesupport to users at key HPCsites worldwide31Confidential 2019 Arm Limited

Arm Performance Engineering Tools EcosystemSee the http://www.vi-hps.org/tools/ for an excellent view of the tools ecosystem.32Confidential 2019 Arm Limited

Hardware Performance Counter SupportHardware performance counter APIs are fully supportedPAPIperf eventsDocumentation and Tools Support for many aarch64server-class CPUs: e.g. ThunderX2 Marvell planning support forfuture CPUs e.g. ThunderX4 Native HPM API is fullysupported User applications may: Initialize the HPM Initiate and reset counters Read counters Generate interrupts oncounter overflow Register interrupt handlersfrom each process andthread independently Arm MAP, HPCToolkit, IPM,TAU, ScoreP, etc. HPM values can beaccessed by non-privilegedusers in a secure manner Performance metrics derivedfrom multiple counters: Partners provide their ownPMU/HPM documentation33Confidential 2019 Arm Limited

The Arm trademarks featured in this presentation are registeredtrademarks or trademarks of Arm Limited (or its subsidiaries) inthe US and/or elsewhere. All rights reserved. All other marksfeatured may be trademarks of their respective idential 2019 Arm Limited

Arm Forge UltimateA cross-platform toolkit for debugging, profiling and performance analysisThe de-facto standard for HPC developmentCommercially supportedby Arm Available on the vast majority of the Top500 machines in the worldFully supported by Arm on Arm servers, x86, IBM Power, Nvidia GPUs, etc.State-of-the art debugging and profiling capabilities Fully Scalable Powerful and in-depth error detection mechanisms (including memory debugging)Sampling-based profiler to identify and understand bottlenecksAvailable at any scale (from serial to petaflopic applications)Easy to use by everyoneVery user-friendly35 2019 Arm Limited Unique capabilities to simplify remote interactive sessionsInnovative approach to present quintessential information to users

Arm Forge – DDT Parallel DebuggerSwitch betweenMPI ranks andOpenMP threadsExport data andconnect tocontinuousintegration36 2019 Arm LimitedAnalyze memory usageVisualize data structuresDisplay pendingcommunications

Arm Forge – MAP Multi-node Low-overhead ProfilerUnderstand MPI/CPU/IO operationsthanks to timelines and metricsInspect OpenMP activityAnalyze GPU efficiencyInvestigateannotatedsource codeand stackview37 2019 Arm LimitedProfile Python-based workloads

Arm Performance Reports Application Analysis ToolAnalyze all performanceaspects in a single HTMLor TXT fileQualify thetype ofworkloadInspect key metrics onSIMD, multithreading,IO, MPI efficiency andmany more 38 2019 Arm LimitedFollow guidanceadvices for yournext steps andmaximize output

a.k.a Arm Compiler for HPC, a.k.a. Arm Allinea Compiler Tuned for Scientific Computing, HPC and Enterprise workloads Processor-specific optimizations for various server-class Arm-based platforms Optimal shared-memory parallelism using latest Arm-optimized OpenMP runtime Linux user-space compiler with latest features