Deploying A Task-based Runtime System On Raspberry Pi Clusters

Transcription

Deploying a Task-based Runtime System on RaspberryPi ClustersPatrick Diehl and Steven R. BrandtUniversity of Louisiana{pdiehl,sbrandt}@cct.lsu.eduExtreme Scale Programming Models and Middleware (ESPM2’20)Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 20201 / 23

MotivationArm -Technology is emergingin supercomputers and datacenters, e.g. Fugaku the fastestsuper computer in the Top500.The low power consumption.The low costs of a Raspberry Piand building a small cluster.One cluster of 4 nodes costsaround 200.Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 20202 / 23

Outline1Tools2Hardware and Software3Benchmarks4ResultsMemoryComputation timeEnergy consumption5Conclusion & OutlookPatrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 20203 / 23

ToolsPatrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 20204 / 23

HPXThe C Standard Library for Concurrency and ParallelismHPX ApplicationPerformance MonitoringThread SchedulingActive Global Address SpaceParcel Transport LayerLocal Control ObjectsOper ating SystemHPX’s lightweight user threads reduce context switching overheadActive Global Address Space (AGAS) makes a unified view of theapplicationOverlapping communication and computation in the Parcel LayerReferenceKaiser et al., (2020). HPX - The C Standard Library for Parallelism and Concurrency. Journal of Open Source Software,5(53), 2352, https://doi.org/10.21105/joss.02352Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 20205 / 23

PhylanxAn Asynchronous Distributed Array Computing ToolkitRun Python code within the HPX runtime system in parallel.Python is a common used language in machine and deep learning.ReferenceTohid, R., et al. ”Asynchronous execution of python code on task-based runtime systems.” 2018 IEEE/ACM 4th InternationalWorkshop on Extreme Scale Programming Models and Middleware (ESPM2). IEEE, 2018.Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 20206 / 23

Hardware and SoftwarePatrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 20207 / 23

Raspberry PI cluster & SoftwareTable: Specification/Architecture of the three nodes utilised in the benchmarks.ModelMicro-architectureProcessor ModelNumber of CPUsCores per CPUTotal CoresFrequencyMemory12Raspberry Pi 3BArm v8-ACortex-A531441.2GHz1GBRaspberry Pi 3B Arm v8-ACortex-A531441.4GHz1GBRaspberry Pi 4BArm rick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 20208 / 23

Raspberry PI cluster & SoftwareTable: Specification/Architecture of the three nodes utilised in the benchmarks.ModelMicro-architectureProcessor ModelNumber of CPUsCores per CPUTotal CoresFrequencyMemoryRaspberry Pi 3BArm v8-ACortex-A531441.2GHz1GBRaspberry Pi 3B Arm v8-ACortex-A531441.4GHz1GBRaspberry Pi 4BArm v8Cortex-A721441.5GHz4GBTable: Overview of the compilers, software, and operating system used.Operating SystemCompilershwloclapack12Ubuntu 20.04 LTSfor Arm gcc ck Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 20208 / 23

BenchmarksPatrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 20209 / 23

2D Jacobi Solver (Shared memory)2D Stencil based on the Jacobi method usingStandard grid layout for GCC’s autovecorize ( 03)Virtual Node Scheme1 for explicit vectorizationRoofline model2 to predict the optimal performancePoptimal Memory Bandwidth AIwith arithmetic intensity (AI) is given by 1/24 for double precision and1/12 for single precision.References1P. Boyle, A. Yamaguchi, G. Cossu, and A. Portelli, “Grid: A next generation dataparallel c qcd library,” arXiv preprint arXiv:1512.03487, 20152S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick, “The roofline model:A pedagogical tool for auto-tuning kernels on multicore architectures,” in HotChips, vol. 20, 2008, pp. 24–26.Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 202010 / 23

1D Heat equation solver (Distributed memory)Parameter: heat transfer coefficient k 0.5, time step dt 1, andgrid spacing dx .02101200.0500.5401.0301.5202.02Figure: Initial conditionsPatrick Diehl and Steven R. Brandt (LSU)600.51012Figure: SolutionAMT on Raspberry PiNovember 11, 202011 / 23

ResultsPatrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 202012 / 23

STREAM TRIAD BenchmarkSTREAM TRIAD ResultsPi 3B/3B have very lowmemory bandwidth (MB)Memory Bandwidth (in MBps)600050004000One single processor unit (PU)already saturated30002000Rpi 3BRpi 3B Rpi 4100001234Core countFigure: Memory Bandwidth results usingthe STREAM TRIAD Benchmark withan array size of 10M elementsPatrick Diehl and Steven R. Brandt (LSU)Pi 4 same behavior, but doubleMBConclusion: Memory bus canonly handle a certain amount ofMB and concurrency at thesame time.AMT on Raspberry PiNovember 11, 202013 / 23

2D stencil (Raspberry Pi 4)2D Stencil: Raspberry Pi 4Performance (in MLUPs/s)Single PrecisionDouble Core count234Core countscalar (auto)vector (explicit)Expected PeakFigure: 2D stencil (Raspberry Pi 4): Grid size of 4096 4096 iterated over a 100time steps.Conclusion: Best performance on 2 cores.Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 202014 / 23

2D stencil (Raspberry Pi 3B )2D Stencil: Raspberry Pi 3B Performance (in MLUPs/s)Single PrecisionDouble Core count234Core countscalar (auto)vector (explicit)Expected PeakFigure: 2D stencil (Raspberry Pi 3B ): Grid size of 4096 4096 iterated over a100 time steps.Conclusion: We can not achieve the expected peak performance.Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 202015 / 23

2D stencil (Raspberry Pi 3B)2D Stencil: Raspberry Pi 3BPerformance (in MLUPs/s)Single PrecisionDouble Core count234Core countscalar (auto)vector (explicit)Expected PeakFigure: 2D stencil (Raspberry Pi 3B): Grid size of 4096 4096 iterated over a 100time steps.Conclusion: PI 3B and 3B have similar performance, because these twomodel differ only in the clock speeds.Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 202016 / 23

Multi-Node BenchmarkExecution Time (s)Raspberry Pi 3BRaspberry Pi 3B 1401401201201001008080606040402020012340Node count1234Node countRaspberry Pi 4160140120100806040200Np 30M, Nt 100Np 60M, Nt 100Np 60M, Nt 5001234Node countFigure: Execution time in seconds for various node counts using all 4 threads.Conclusion: Multi-node codes can scale well.Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 202017 / 23

Multi-Node BenchmarkExecution Time (s)Raspberry Pi 3BRaspberry Pi 3B 706060505040403030202010100Raspberry Pi 4123401Node count23416141210864201 thread/node2 threads/node3 threads/node4 threads/node1Node count234Node countFigure: Execution time in seconds for various node counts using various threads.Conclusion: Threads provide little performance gain, and actually hurt onthe 3 and 3 .Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 202018 / 23

Phylanx: ALS BenchmarkPerformance of the ALS BenchmarkExecution Time (s)1098Rpi 4Rpi 3BRpi 3B 765431234Core countFigure: Execution time in seconds for various core counts.Conclusion: Our ALS code is fastest on 2 cores, but probably needs moredevelopment.Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 202019 / 23

Cost wrt Power Consumption1D stencil: Cost wrt Power ConsumptionALS: Cost wrt Power Consumption0.812100.6Cost (in 1e-5 US )Cost (in 1e-5 US 3pRpi4ModelFigure: Cost with respect to powerconsumption for the 1D stencil codeusing 30 million stencil points periteration and a total of 100 iterations.Figure: Cost with respect to powerconsumption for the Alternating LeastSquare (ALS) benchmark in Phylanx forthe MovieLens 20m database.The power consumption for all models was obtained using the Linuxcommand stress3 for all four cores.3https://linux.die.net/man/1/stressPatrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 202020 / 23

Conclusion & OutlookPatrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 202021 / 23

Conclusion & OutlookConclusionLimited memory bandwith limits the utilization of all cores.Ubuntu Server 2020 supports 32-bit and 64-bit.Frequency setting of “performance” used instead of “ondemand.”The cluster provides modest performance at a reasonable cost.OutlookUse the small and affordable cluster for teaching parallel anddistributing computing.The interface to attach sensors could be used in field studies tocollect data and Phylanx could process the data before uploading tomore powerful devices to do the analysis.Use a larger cluster and more sophisticated Arm hardware.Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 202022 / 23

This work is licensed under a atives 4.0 International” license.Patrick Diehl and Steven R. Brandt (LSU)AMT on Raspberry PiNovember 11, 202023 / 23

Model Raspberry Pi 3B Raspberry Pi 3B Raspberry Pi 4B Micro-architecture Arm v8-A Arm v8-A Arm v8 Processor Model Cortex-A53 Cortex-A53 Cortex-A72 Number of CPUs 1 1 1 Cores per CPU 4 4 4 Total Cores 4 4 4 Frequency 1.2GHz 1.4GHz 1.5GHz Memory 1GB 1GB 4GB Table:Overview of the compilers, software, and operating system used.