What We Did For Co-Design In Development Of

Transcription

ORAP Forum, November 29, 2019 @ParisWhat We Did for Co-Designin Development of“Fugaku”Mitsuhisa SatoTeam Leader of Architecture Development TeamDeputy project leader, FLAGSHIP 2020 projectDeputy Director, RIKEN Center for Computational Science (R-CCS)Professor (Cooperative Graduate School Program), University of Tsukuba

MissionsFLAGSHIP2020 Project “Fugaku” Status and Update Building the Japanese national flagship March 2019: The official contract with Fujitsu tosupercomputer “Fugaku “(a.k. a post K), andmanufacture, ship, and install hardware for Fugaku is done Developing wide range of HPC applications, running RIKEN revealed #nodes 150Kon Fugaku, in order to solve social and science issues March 2019: The Name of the system was decided asin Japan (application development proj will be over“Fugaku”at the end of march) Aug. 2019: The K computer decommissioned, stopped the Overview of Fugaku architectureservices and shutdown (removed from the computer room)Node: Manycore architecture Oct 2019: access to the test chips was started. Armv8-A SVE (Scalable Vector Extension) Nov. 2019: Fujitsu announce FX1000 and FX700, and SIMD Length: 512 bitsbusiness with Cray. # of Cores: 48 (2/4 for OS) ( 2.7 TF / 48 core) Nov 2019: Fugaku clock frequency will be 2.0GHz and boost Co-design with application developers and high memoryto 2.2 GHz.bandwidth utilizing on-package stacked memory (HBM2) Mov 2019: Green 500 1st position!1 TB/s B/W Oct-Nov 2019: MEXT announced the Fugaku “early access Low power : 15GF/W (dgemm)program” to begin around Q2/CY2020Network: TofuD Around Jan 2020: Installation of “Fugaku” will be started. Chip-Integrated NIC, 6D mesh/torus InterconnectNov/29/20192

No.1 in Green500 at SC19!Nov/29/2019Announce fromFujitsu at SC193

KPIs on Fugaku development in FLAGSHIP 2020 project3 KPIs (key performance indicator) were defined for Fugaku development 1. Extreme Power-Efficient System Maximum performance under Power consumption of 30 - 40MW (for system) Approx. 15 GF/W (dgemm) confirmed by the prototype CPU 1st in Green 500 !!! 2. Effective performance of target applications It is expected to exceed 100 times higher than the K computer’s performance in someapplications 125 times faster in GENESIS (MD application), 120 times faster in NICAM LETKF (climatesimulation and data assimilation) were estimated 3. Ease-of-use system for wide-range of users Co-design with application developers Shared memory system with high-bandwidth on-package memory must make existingOpenMP-MPI program ported easily. No programming effort for accelerators such as GPUs is required.Nov/29/20194

Target Application’s Performance Performance Targets 100 times faster than K for some applications (tuning included) 30 to 40 MW power tmlAs of 2019/05/14 Predicted Performance of 9 Target ApplicationsAreaHealth andlongevityDisasterprevention andEnvironmentPerformanceSpeedup over KApplicationBrief descriptionx125 GENESISMD for proteinsx8 GenomonGenome processing(Genome alignment)3. Integrated simulation systems induced byearthquake and tsunamix45 GAMERAEarthquake simulator (FEM in unstructured & structured grid)4. Meteorological and global environmentalprediction using big datax120 NICAM LETKFWeather prediction system using Big data (structured grid stencil &ensemble Kalman filter)5. New technologies for energy creation, conversion/ storage, and usex40 NTChemMolecular electronic(structure calculation)6. Accelerated development of innovative cleanenergy systemsx35 AdventureComputational Mechanics System for Large Scale Analysis and Design(unstructured grid)7. Creation of new functional devices and highperformance materialsx30 RSDFTAb-initio program(density functional theory)8. Development of innovative design and productionprocessesx25 FFBLarge Eddy Simulation (unstructured grid)9. Elucidation of the fundamental laws and evolutionof the universex25 LQCDLattice QCD simulation (structured grid Monte Carlo)Priority Issue1. Innovative computing infrastructure for drugdiscovery2. Personalized and preventive medicine using bigdataEnergy issueIndustrialcompetitiveness enhancementBasic science5

Co-design in HPC Co-design” in Wikipedia “Co-design or codesign is a product, service, or organization development process where designprofessionals empower, encourage, and guide users to develop solutions for themselves.” . “The phrase co-design is also used in reference to the simultaneous development of interrelatedsoftware and hardware systems. The term co-design has become popular in mobile phonedevelopment, where the two perspectives of hardware and software design are brought into a codesign process“The co-design of HPC must optimize and maximize the benefits to covermany applications as possible. different from "co-design" in embedded systems. For example, in embedded field, codesign sometimes includes "specialization" for a particular applications.On the other hands, in HPC, the system will be shared by many applications.6

Why “co-design” is neededin very high-end HPC and exascale? In modern very high-end parallel system, more performance can be delivered (evenupto “exascale”) by increasing the number of nodes, but We need to design the system by trade-off between “energy/power” and “cost” andperformance to satisfy constraints of “energy/power” and “cost” to maximize the performance.We need to design the system by taking characteristics ofapplications in account “codesign” in HPCThe "co-design" in our post-K project must be We design processor/network and system with the selected partner vender. Different from supercomputer acquisition in universities.7

Co-design in HPC Richard F. BARRETT, et.al. “On the Role of Co-design in High Performance Computing”, Transition of HPCTowards Exascale ComputingNov/29/20198

Before starting FLAGSHIP2020 Project The K computer project (2016-2012) The public service of the K computer started from 2012.Feasibility study project (2012-2013) Application study team leaded by RIKEN AICS (Tomita) System study team leaded by U Tokyo (Ishikawa) System study team leaded by U Tsukuba (Sato) Next-generation “General-Purpose” SupercomputerExascale heterogeneous systems with acceleratorsSystem study team leaded by Tohoku U (Kobayashi)Initial Plan at the beginning (2014) was a combined systemwith “General-Purpose” Supercomputer and “accelerators” The Basic design follows “generalpurpose” processor by FS of U. TokyoAlmost co-design done in this phaseBut, the development of “accelerators” was canceled due tobudget problem.Nov/29/20199

Technologies and Architectural Parameters to be determined Chip configuration #core, #NUMA node Cache configuration (shared within NUMA node), cache size Chip connection, Memory technologies (DDR, HBM, HMC ) Frequency, Silicon Fabrication technology (10nm, 7nm FinFET, ),Chip die-size, power consumptionCore specification #SIMD pipe, SIMD vector length Cache in Core (size, latency, ) O3 resources specialized hardwareInterconnect We had to examinethe technologiesavailable 3 or 4 yearafter the design, andCost (project budget)!SerDes, “Tofu” or other network?10

Almost co-design done in this ologyIOtechnology16nm FinFETHMCTechnologies trends10nm FinFET7nm FinFETHBM2HBM1CoWoS (Si Interposer)4x28.05G56G ?PCIe Gen3PCIe Gen411

Chip configuration #cluster, #core/cluster, NOC Chip size is a major factor of cost. MCM & interposer (organic, silicon, glass ) Memory tech. cost HMC (mem size 、Power ) HBM (mem size 、Power ) DDR (mem size 、Power , cost )cost gettinghigher beyondsome sizeChip sizeLarge chip HBM(selected as A64FX)Performance was estimated by “analytical models”for simple kernel such as HPLSmall chip, usterHMCClusterHMCNIC & terNIC & PCIeClusterHMCNIC & PCIeHMCNICNIC&&PCIePCIeHMCSmall chip HMC with MCMLarge chip, HMCHBMMCMHBMHBM12

Target applications for co-design Typical benchmarks such asHPL (dgemm) and stream Target applications providedby each application projects(“9 priority issues”)Target applications arerepresentatives of almost all ourapplications in terms ofcomputational methods andcommunication patterns inorder to design architecturalfeatures.13

Target science: 9 Priority �能制御による 革新的創薬基盤の構築①Innovative Drug DiscoveryRIKEN Quant. Biology Center⑧ Innovative Design andProduction Processes for theManufacturing Industry in theNear Future③Hazard and Disaster induced byEarthquake and TsunamiInst. Medical Science, U. TokyoEarthquake Res. Inst., U. Tokyo重点課題③ �測システムの構築重点課題④ tal Laws andEvolution of the Universe④Environmental Predictionswith Observational Big DataCent. for Earth Info., JAMSTECCent. for Comp. Science, U. TsukubaCenter for Earth Info., JAMSTEC⑦New Functional Devices andHigh-Performance⑥Innovative Clean EnergySystems⑤High-Efficiency Energy Creation,Conversion/Storage and UseInst. For Solid State Phys., U. TokyoGrad. Sch. Engineering, U. TokyoInst. Molecular Science, NINS重点課題⑧ 近未来型ものづくりを先導する �課題⑦20018/02/27②Personalized and PreventiveMedicine重点課題② 個別化・予防医療を支援する ��重点課題⑨ �⑥ 革新的クリーンエネルギー �測の高度化重点課題⑤ �蔵、 利用の新規基盤技術の開発14

Tools for co-design Performance estimation tool Modeled by Fujitsu micro-architectureFujitsu in-house simulators Extended FX100 (SPARC) simulator and compiler for preliminary studies Armv8 SVE simulator and compiler Enables performance projection using Fujitsu FX100 execution profile to a set ofarch. parameters.(gem5 simulator has been developed after co-design for architecture verificationand performance tuning by RIKENEmulator for logic design verification15

Performance estimation tool Estimate the performance of multithreaded programs on “new architecture” (post-K) from theprofile data taken on Fujitsu FX100 Estimate a maximum performance when busy-time of each component is hidden by others,based on Fujitsu’s micro-architecture.Fujitsu FX100 execution profileAnd execute-time break-downProjectionby throughput ofeach element(similar to “roofline mode”)Break-downexec timeMemory accbusy time20018/02/27Memory Wacc busy timeL2 busy timeL1 busy timeEstimated execute-timebreak-down on newarchitecture by projection16

Co-design Methodology Extract kernels from each application benchmark, andbreakdown to a set of kernels Co-design process for each kernels1.2.3.4. Setting a set of system parametersTuning target applications under the system parametersEvaluating execution time using “estimation tool”Identifying hardware bottlenecks and changing the set ofsystem parametersAnalytical models for simple kernel such as HPL anddgemm, stream17

Identify kernel loops Our performance estimation tool estimates performance according to“throughput” of each block in core, according to Fujitsu micro-architecture. Identify kernels from each application benchmark, and breakdown to a set ofkernels to estimate execution time of each kernel.profile2do K 1, enddoKernel3profile3TotalExecutionTimeExec T3Kernel2PerformanceEstimationToolexec T2do J 1, N1 .enddoExec T1Kernel1profile1do I 1,N enddo18

Extract Kernels Extract kernels from each application benchmark for more detail analysis. This set of kernels is useful to evaluate performance on the simulator,which takes long time to execute.SimulatorExec T3Kernel3ORexec T2do K 1, enddoKernel2PerformanceEstimationToolExec T1Kernel3do J 1, N1 .enddoKernel1profile3do K 1, enddoKernel2do I 1,N enddoprofile2do J 1, N1 .enddoKernel1profile1do I 1,N enddo19

A64FX Optimized Load Efficiency for Apps Performance 128 bytes/cycle sustainedbandwidth even for unalignedSIMD loadRead data064B/cycleL1D cacheRead port0Read data164B/cycleRead port1Suggested through Co-design work w/ app teams “Combined Gather” doublesgather (indirect) load’s datathroughput, when targetelements are within a “128-bytealigned block” for a pair of tworegs, even & odd128BMem.01flow-13275 4flow-4flow-36flow-2Regs0 1 2 3 4 5 6 78BMaximizes BW to 32 bytes/cyc.20

Low-power Design & Power Management Leading-edge Si-technology (7nm FinFET) Low power logic design (15 GF/W @ dgemm) A64FX provides power management function called “Power Knob” FL pipeline usage: FLA only, EX pipeline usage : EXA only, Frequency reduction User program can change “Power Knob” for power optimization “Eco-mode” : FLA only with lower “stand-by” power for ALUs “Energy monitor” facility enables chip-level power monitoring and detailed power analysisof applicationsReduce the power-consumption for memory intensive apps.4 apps out of 9 target applications select “eco-mode” for the max performance under thelimitation of our power capacity (Even using HBM2!)Retention mode: power state for de-activation of CPU with keeping network alive 2019/5/17Large reduction of system power-consumption at idle time

Co-design of network We have selected “Tofu” network for performance compatibility in large-scaleapplications.Select Link-speed 28Gps x 4 due to technology availability around 2019Communication patterns ware extracted, and the communication performancewas estimated by “analytical model”. Many target applications have neighbor communicationpattern, or communication to near nodes. “Tofu”and 28Gbp link were sufficient.Some apps have all-to-all communication. We studied the benefits or feasibility of additional“dedicated” all-to-all network, but it ware notselected due to costSupport 3 DP reduction by Tofu TBI for QCD apps“Common” programing model will be to run eachMPI process on a NUMA node (CMG) with OpenMPMPI hybrid programming. 48 threads OpenMP is also supported.K w/ TBITBI: Tofu Barrier Interface22

TSMC 7nm FinFET CoWoS technologiesfor HBM2HBM223

CPU A64FXArchitectureArmv8.2-A SVE (512 bit SIMD)48 cores for compute and 2/4 for OS activitiesCoreNormal: 2.0 GHzDP: 3.072 TF, SP: 6.144 TF, HP: 12.288 TFBoost: 2.2 GHzDP: 3.3792TF, SP: 6.7584 TF, HP: 13.5168 TFCache L164 KiB, 4 way, 230 GB/s(load), 115 GB/s (store)Cache L2CMG(NUMA): 8 MiB, 16wayNode: 3.6 TB/sCore: 115 GB/s (load), 57 GB/s (store)MemoryHBM2 32 GiB, 1024 GB/sInterconnectTofuD (28 Gbps x 2 lane x 10 port)I/OPCIe Gen3 x 16 laneTechnology7nm FinFET4 NUMA NodesPerformanceStream triad: 830 GB/sDgemm: 2.5 TF (90 % efficiency)ref. Toshio Yoshida, “Fujitsu High Performance CPU for the Post-K Computer,”IEEE Hot Chips: A Symposium on High Performance Chips, San Jose, August 21, 2018.Courtesy of FUJITSU LIMITED

TofuD Interconnect2 lanes x 10 portsTNR(Tofu Network Router)TNI5TNI4TNI3TNI2TNI1TNI040.8 GB/s(6.8 GB/s x 6)TNI: Tofu Network Interface (RDMA engine) 6 RDMA Engines Hardware barrier support Network operation offloading capability8B Put latency0.49 – 0.54usec1MiB Put throughput 6.35 GB/srf. Yuichiro Ajima, et al. , “The Tofu Interconnect D,” IEEE Cluster 2018, 2018.

26

Fugaku prototype board and rackHBM260mm60mmWaterWaterAOCQSFP28 (Z)AOCQSFP28 (Y)AOCQSFP28 (X)Electrical signals2 CPU / CMUNov/29/2019Shelf: 48 CPUs (24 CMU)Rack: 8 shelves 384 CPUs (8x48)27

Fugaku System Configuration 150k node Two types of nodes Boost mode: 3.3792TF x 150k 500 PFCompute Node and Compute & I/O Node connected by Fujitsu TofuD, 6D mesh/torus Interconnect3-level hierarchical storage system 1st Layer One of 16 compute nodes, calledCompute & Storage I/O Node, has SSDabout 1.6 TBServices-Cache for global file systemTemporary file systems- Local file system for compute node- Shared file system for a job 2nd Layer Fujitsu FEFS: Lustre-based global filesystem3rd Layer Cloud storage services

Advances from the K computerK computerFugakuratio# core848Si tech. (nm)457Core perf. (GFLOPS)16 644Chip(node) perf. (TFLOPS)0.128 3.024Memory BW (GB/s)641024B/F (Bytes/FLOP)0.50.4#node / rack963844Rack perf. (TFLOPS)12.3 1179.696#node/system82,944 150,000System perf.(DP PFLOPS)10.6 460.8Si Tech43Boost mode: 3.3792TF x 150k 500 PF SVE increases core performance Silicon tech. and scalable architecture (CMG) to increase node performance HBM enables high bandwidthNov/29/2019SVECMG&Si TechHBMMore than 7.5 MGeneral-purposecores!29

Co-design in HPC Richard F. BARRETT, et.al. “On the Role of Co-design in High Performance Computing”, Transition ofHPC Towards Exascale ComputingNov/29/201930

Co-design of Apps for Architecture (at early stage) Tools for performance tuning Performance estimation tool Performance projection using Fujitsu FX100execution profileGives “target” performancePost-K processor simulatorNow, Test chips of A64FX is available!Very slow, solimitedkernel-level evaluationWearetoevaluatingthe performanceand doing performance tuning, usingTargetAs isCo-design of appsperformancethetestchip 1. Estimate “target” performance using Based on gem5, O3, cycle-level simulation performance estimation toolNov/29/2019 2. Extract kernel code for simulator 3. Measure exec time using simulator 4. Feed-back to code optimization 5. Feed-back to compilerTuning 1Tuning 2Execution time ator31

Benchmark Results on test chip A64FX CloverLeaf A hydrodynamics mini-app to solve the compressible Euler equations in2D, using an explicit, second-order methodStencil calculation TeaLeaf (UK Mini-App Consortium), Fortran/C(UK Mini-App Consortium), FortranA mini-application to enable design-space explorations for iterative sparselinear solvers https://github.com/UK-MAC/TeaLeaf ref.git Problem size: Benchmarks/tea bm 5.in, end step 10 - 3 LULESH Nov/29/2019(LLNL), CMini-app representative of simplified 3D Lagrangian hydrodynamics on anunstructured mesh, indirect memory access32

Benchmark Results on test chip A64FXDisclaimer:The software used for the evaluation, such as the compiler, is still under development andits performance may be different when the supercomputer Fugaku starts its operation. Platform A64FX test chip (2.0 GHz) ThunderX2 @ Apollo70 28C/2S @ 2.0GHz Arm HPC compiler 19.1 Compiler Options Broadwell (Xeon E5-2680 v4) 14C/2S @ 2.4GHz Intel compiler 2019.0.045Skylake (Xeon Gold 6126) @ Cygnus, Univ. ofTsukuba 12C/2S @ 2.6GHz Intel compiler 19.0.3.199Nov/29/2019Fujitsu compilerArm HPC compiler -Kfast,openmp-Ofast -march armv8-a( sve)Intel compiler -O3 -qopenmp -march native33

CloverLeafExecution time# of threads 1# of threads 4# of threads 8Relative performance (to 1T/A64FX)# of threads 12# of threads 125# of threads 8# of threads 1298Relative performance20Execution time [sec]# of threads TX2BroadwellA64FXTX2Broadwell Evaluation using one CMG(NUMA node) without MPI Good scalability by increasing the number of threads within CMG. One GMG performance is comparable to Intel one. (Chip contains 4 CMG!)Nov/29/201934

Execution time1400A64FXTX2TeaLeafXeonWall clock time [sec]1200 Evaluation of MPI program withinone chip (upto 4 MPI process) Changing #threads within CMG The speedup is limited for morethan 4 threads due to the memorybandwidth (?) We need more 28124Relative performance (/A64FX 1proc 1thread)# of threads14Relative performance(to 1T/A64FX)# of processesA64FXTX2Xeon12Xeon @ Cygnus, Univ. ofTsukubaIntel Xeon Gold 61262.6GHz; 12 core x 2 socket10864201481121482# of threadsNov/29/2019# of processes1214812435

LULESH# of threads 1# of threads 4# of threads 8# of threads 1235003000FOM lA64FXTX2Broadwell Evaluation using one CMG(NUMA node) without MPI One CMG performance is less than Thx2 and Intel one We found low vectorization (SIMD (SVE) instructions ratio is a few %) We need more code tuning for more vectorization using SIMD36

Nov/29/201937

Nov/29/201938

Co-design in HPC. 6 Co-design" in Wikipedia "Co-design. or. codesign. is a product, service, or organization development process where design