Performance Of OpenFOAM@Fugaku

Transcription

Performance of OpenFOAM@FugakuResearch Organization for Information Science and Technology (RIST)Sep. 28th 2021

Benchmark test HPC Committee benchmark https://develop.openfoam.com/committees/hpc [1] HPC motorbike S (8.6M cells) M(17.2M cells) L (34M cells) (unmeasured)Fig.1 HPC motorbike image. [2] Lid driven Cavity-3D S( 1M cells) M ( 8M cells) XL ( 64M cells) XXL (216M cells)Fig.2 Lid driven cavity-3D image. [3] 2021 RIST2

Overall information OpenFOAM version Performance was measured with the following versions, built according to the installation procedurescreated by RIST, not using Spack recipe. v2012 (AS-IS)OpenCFD Ltd Hardware: Supercomputer Fugaku (A64FX) Fugaku Prepost environment Large memory node : Intel Xeon Platinum 8280L (2.70GHz/28core)x4 GPU node: Intel Xeon Gold 6240 (2.60GHz/18core)x2 language environment tcsds-1.2.31(fj4.5.0) Compiler options -Nclang -std c 11 -m64 -march armv8.2-a sve -mcpu a64fx -ffp-contract fast -ffast-math-O3 -funsafe-math-optimizations measurement method real time by time command (ex.) time mpiexec –np 48 icoFoam 2021 RIST3

HPC motorbike

HPC motorbike/Small simpleFoam in Allrun@Fugaku The parallel efficiency of 50% or more is maintained up to 1024 parallel. 32p/n indicates that 32 MPI processes per node were executed. 48p/n is faster up to 32 nodes, but 32/n is faster for 64 nodes and above .# of MPI processes#nodes ,536642,0483,0721284,0966,144 Although not shown in the graph above, single core execution took T(1) 49,797 seconds(13h49m57s). 2021 RIST5

HPC motorbike/Medium simpleFoam in Allrun@Fugaku# of MPI 8321,0241,536642,0483,0721284,0966,144#nodes The parallel efficiency of 50% or more is maintained up to 1024 parallel. 32p/n indicates that 32 MPI processes per node were executed. 48p/n is faster up to 32 nodes, but 32/n is faster for 64 nodes and above. Although not shown in the graph above, single core execution took T(1) 102,401 seconds(28h26m41s). 2021 RIST6

HPC motorbike/Small: Pre-processing (AllmeshS) Pre-processing(AllmeshS) Before running snappyHexMesh,blockMesh and decomposePar areprocessed in a few seconds, so they arealmost invisible on the graph. Large memory node with Intel XeonPlatinum and 6TB memory It runs fastest on GPU node with XeonGold. (GPU unused) Only snappyHexMesh is executed inparallel and the other modules areexecuted in serial. We recommend processing AllmeshS onthe pre-post node(Xeon).Original shell image of AllmeshS 2021 RIST7

HPC motorbike/Small: Pre-processing (Allrun) The decomposePar in Allrun is executedagain. The decomposePar is executed serially todecompose the polyMesh according to thenumber of parallel processes for simpleFoam. As the number of subdomains to bedecomposed increases, the difference inprocessing time between Xeon and Fugakudecreases.#!/bin/sh# Source tutorial run functions. WM PROJECT DIR/bin/tools/RunFunctionsrm log.decomposePar log.simpleFoam log.potentialFoam./Allcleancp -r 0.org 0LLIO (/ share) was used for execution on Fugaku. 2021 RISTrunApplication decomposeParrunParallel potentialFoam -writephirunParallel (getApplication)mkdir FIGSfoamLog log. (getApplication)simpleFoamOriginal shell image of Allrun8

Lid driven cavity-3D

Lid driven Cavity-3D (Test cases) S(( M1M cells)8M cells)2 cases fixedNORM exit norm is fixed tolerance relTol1e-040 maxiter3000 DIC-PCG fixedITER computational load is fixed tolerance relTol maxIter DIC-PCG 2021 RIST00250 XL ( 64M cells) (fixedNORM) XXL (216M cells) (fixedNORM) tolerance1e-04 maxiter30005 l-basedincomplete-CholeskyConjugate onjugate GradientPETSc-ICC-CGIncomplete-CholeskyConjugate GradientPETSc-AMG-CGClassic algebraicmultigrid(BoomerAMG)Conjugate GradientPETSc-AMG-CG caching“never” was specified. The same solverdescribed above. The storing of the matrix’scoefficients is done once at the beginning oftime dependent simulation.[4]10

Lid driven Cavity-3D (S) 1 M cells 1,300 cells/core at 768 cores execution Scalability has been obtained from 12cores to 768cores. 2021 RIST11

Lid driven Cavity-3D (M) 8 M cells 650cells/core at 12,288 cores execution Scalability has been obtained from 192cores to 12Kcores. 2021 RIST12

Lid driven Cavity-3D (XL)[4],[5] 64 M cells 7Kcells/core at 9,216 cores execution PETSc-AMG-CG.caching is the best performance until 9.2K cores. Scalability has been obtained from FOAM-DIC-PCG and PETSc-ICC-CG, but AMG series has not. 2021 RIST13

Lid driven Cavity-3D (XXL)[4],[5] 216 M cellssystem/controlDict endTime and deltaT was modified 23Kcells/core at 9,216 cores execution PETSc-AMG-CG.caching is the best performance until 9.2K cores. Scalability of FOAM-DIC-PCG and PETSc-ICC-CG are better than AMG series at9216 cores. 2021 RISTdiff -r XXL/system/controlDictXXL.new/system/controlDict endTime0.00625;-- endTime0.0015625;30c30 deltaT0.0000625;-- deltaT0.000015625;14

Lid driven Cavity-3D (M) Pre-processing decomposePar run time increases asnumber of subdomains increases. Large Memory(pre-post) node is 3 4times faster than Fugaku. We recommend Pre-processing(blockMesh & decomposePar)execution on the pre-post node(Xeon). 2021 RIST15

References [1] HPC Committee benchmark https://develop.openfoam.com/committees/hpc [2] HPC motorbike assets /develop/HPC motorbike/assets [3] HPC Benchmark Project 8th OpenFOAM Conference 2020 https://wiki.openfoam.com/images/c/c1/OpenFOAM 2020 CINECA Spisso.pdf [4] PETSc4FOAM : A Library to plug-in PETSc into the OpenFOAM Framework -Framework.pdf [5] PETScFOAM er 2021 RIST16

2021 RIST OpenFOAM version Performance was measured with the following versions, built according to the installation procedures created by RIST, not using Spack recipe. v2012 (AS-IS) OpenCFD Ltd Hardware: Supercomputer Fugaku (A64FX) Fugaku Prepost environment Large memory node : Intel Xeon Platinum 8280L (2.70GHz/28core)x4 GPU node: Intel Xeon Gold 6240 (2.60GHz/18core)x2