DL4Fugaku: AI Frameworks On Fugaku - University Of Tennessee

Transcription

DL4Fugaku: AI frameworks on FugakuBOS: Challenges and opportunities with running AI workloads on HPC systems11th JLESC Workshop, September 9th, 2020Kento SatoHigh Performance Big Data Research TeamRIKEN Center for Computational Science (RIKEN R-CCS)※ Some of software introduced in the slides is under development.Experimental results will be changed in future in the course of tuning

Supercomputer Fugaku & Deep learningllllLarge-scale deep learning is emerging as an essential machine learning approach for manyresearch challenges such as image recognition, segmentation and natural language processingFast and scalable large-scale deep learning enables us to train neural networks with moretraining data in shorter timeFugaku/A64FX is expected to achieve high performance DNN training/inferenceGPU has become a popular platform for executing DL, but we revisit the idea of running DL onCPUs in large-scale environmentsA64FX: Summaryn Arm SVE, high performance and high efficiencyà High performance FP16/INT8n DP performance 2.7 TFLOPS, 90%@DGEMMn Memory BW12x compute cores1x assistant core1024 GB/s, 80%@STREAM TriadA64FXPCleControllerISA (Base, extension)TofuInterfaceCMGProcess technologyCMGCMGNOCCMGCSIMD width# of coresHBM2HBM2CCHBM2HBM2CPeak DP performanceMemory capacityMemory peak bandwidthPCIeCMGSCAsia2019, March 12Core Memory Group NOCNetwork on ChipHigh speed interconnect13Armv8.2-A, SVEà High bandwidth memory (1024 GB/sec)7 nm2.7 TFLOPS512-bit48 432 GiB (HBM2 x4)1024 GB/sà Scalable TofuD interconnectGen3 16 lanesTofuD integrated 2019 FUJITSUSource: Toshiyuki Shimizu, Post-K Supercomputer with Fujitsu's Original CPU, A64FX Powered by Arm ISATo make use of Fugaku/A64FX performance, tuning AI software stack is indispensable2

DL4Fugaku: Deep learning for FugakulObjective: Fast and scalable deep learning on Fugaku/A64FXlllConduct porting, performance analysis and tuningDeploy large-scale deep learning environmentEnhance the usability for production use in FugakulMOU for RIKEN/Fujitsu collaboration on AI framework development in FugakulRIKEN R-CCS internal teams are working togetherl Under collaboration with Industry & academial Porting, tracing DL, performance analysis, tuning, merge to upstreamRIKEN R-CCSOperation teamResearch teamsCollaboratorsApplication tuning development unitAISTFujitsu limitedHigh performance AI system research teamARMLinaroHigh performance big data research teamCybozuTokyo TechLarge-scale parallel numerical computingtechnology research teamFujitsuLaboratories(alphabetical order)Nov. 25th, 2019Right︓ Naoki Shinjo, Head of Unit. PlatformDevelopment Unit. Fujitsu LimitedLeft︓Satoshi Matsuoka, R-CCS Director3

Survey on DL framework usage in JapanllPeriodl Oct., 2019 to Nov., 2019Target organizations and usersl RIKEN R-CCSl RIKEN AIPl Users from HPCI Strategic Programl Users of ABCI at AISTà Potential Fugaku users who use DLframeworks answered this 26%TensorFlow Keras16%※ “Other” users develop anduse their own DL frameworksPopular DL frameworks are either TensorFlow, PyTorch or Chainerè We plan to support these three frameworks on Fugaku4

Porting and Tuning approachlDeep learning software stacklDeep learning frameworks are relying on low-level numerical librariesoptimized for specific hardwarellApproachllcuDNN for NVIDIA GPU, OneDNN for Intel CPU, ? for A64FXWe decided to tune OneDNN for Fugakuʼs A64FX CPUs(OneDNN aarch64) instead of full scratch developmentCurrent statuslMost of porting and tuning are finishedlThe source codes are in a github repositoryllhttps://github.com/fujitsu/dnnl aarch64We also contribute to upstream of OneDNN repoFrameworksDL frameworks(TensorFlow, PyTorch, Chainer etc.)Low-level Intel CPUA64FXIntel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN)à Deep Neural Network Library (DNNL)à oneAPI Deep Neural Network Library (oneDNN)Slide courtesy of Jin Takahashi, Fujitsu laboratory ltd. with translation and modifications5

Perfomrance Evaluation: ResNet-50 on A64FX (A single node)lEnvironmentl HW: A64FX (2.2GHz, 48 cores, HBM2 32GB)l SW: Fujitsu compier (fcc), Fujitsu numerical librariys (SSL-II)Ref.) NVIDIA GPU v100: 905 ips low v2.1.0PyTorch v1.5.0 -*2 *2*1: Batch Size 36 x 4*2: Batch Size 75 x 4OneDNNaarch64 tcmallocx3.225.1x1.281.00 50[ips]100Inference(FP32)-*3 -*3 *399.1x3.3x1.2384.5100200[ips]-85.6*4 30040086.9*40*4: Batch Size 61 x 4OneDNNaarch64 tcmalloc323.60 x1.098.7ー*3: Batch Size 512 x 4OneDNNTraining(FP32)aarch64 tcmallocx9.2*4 9.3ー-Training(FP32)OneDNNaarch64 7.7ー-*5 -*5294.8 *5293.1*5: Batch Size 128 x 40100200[ips]x1.0300[1] NVIDIA Data Center Deep Learning Product Performance, ance-training-inferenceSlide courtesy of Jin Takahashi, Fujitsu laboratory ltd. with translation and modifications4006

Perfomrance Evaluation: ResNet-50 on A64FX (Multi-node)Environmentl HW: A64FX (2.2GHz, 48 cores, HBM 32GB), TofuD interconnectl SW: Fujitsu compier (fcc), Fujitsu numerical librariys (SSL-II), Horovod1000800[ips]lResNet-50 Training(FP32)PyTorch︓Perf. (8 nodes)/Perf. (1 node) 781.5/98.2 85.602390.7651.5TensorFlow︓Perf. (8 nodes)/Perf. (1 node) 651.5/85.6 7.61330.146# of Nodes810Now, we are benchmarking in larger scale in other NNsMore results will be open soonSlide courtesy of Jin Takahashi, Fujitsu laboratory ltd. with translation and modifications7

3 lObjective: Fast and scalable deep learning on Fugaku/A64FX l Conduct porting, performance analysis and tuning l Deploy large-scale deep learning environment l Enhance the usability for production use in Fugaku lMOU for RIKEN/Fujitsu collaboration on AI framework development in Fugaku lRIKEN R-CCS internal teams are working together l Under collaboration with Industry & academia