MAGMA - ICL

Transcription

MAGMAMatrix Algebra on GPU and Multicore ArchitecturesInnovative Computing LaboratoryElectrical Engineering and Computer ScienceUniversity of TennesseePiotr Luszczek (presenter)web.eecs.utk.edu/ luszczek/conf/

MAGMA: LAPACK for GPUs MAGMA––– MAGMA BLAS––– Matrix Algebra for GPU and Multicore ArchitectureTo provide LAPACK/ScaLAPACK on hybrid architectureshttp://icl.cs.utk.edu/magma/A subset of BLAS for GPUsHighly optimized for NVIDIA GPGPUsFast GEMM for FermiMAGMA developers & collaborators––UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi Arabia)Community effort, similar to LAPACK/ScaLAPACK

MAGMA Software StackCPUdistributedGPUHybrid LAPACK/ScaLAPACK & Tile Algorithms / MORSE / ParSECMAGMA 1.3multi-GPUHYBRIDHybrid Tile (PLASMA) AlgorithmsPLASMA / QUARKMAGMA 1.3StarPU run-time systemHybrid LAPACK and Tile KernelsMAGMA 1.3MAGMA SPARSEMAGMA BLASsingleLAPACKBLASBLAS Linux, Windows, Mac OS XCUDA / OpenCL / MIC C/C , Fortran Matlab, Python

MAGMA Functionality 80 hybrid algorithms have been developed (total of 320 routines) Every algorithm is in 4 precisions (s/c/d/z) There are 3 mixed precision algorithms (zc & ds) These are hybrid algorithms, expressed in terms of BLAS MAGMA BLASA subset of GPU BLAS, optimized for Tesla and Fermi GPUs

MAGMA Methodology Overview A methodology to use all available resources:MAGMA uses hybridization methodology based on–– Successfully applied to fundamentallinear algebra algorithms–– Representing linear algebra algorithms as collectionsof tasks and data dependencies among themProperly scheduling tasks' execution overmulticore and GPU hardware componentsOne- and two-sided factorizations and solversIterative linear and eigensolversProductivity–––Use high-level description; low-level hidden withproper abstractionsLeverage prior effortsExceed the performance of homogeneous solutionsHybridHybridCPU GPUCPU rmulticoresandmulticores andlargelargetaskstasksforforGPUs)GPUs)

Hybrid Algorithms Use case: one-sided factorization– LU, QR, CholeskyHybridization procedure–Panels are factored on CPU using LAPACK (or equivalent) –PANEL–It is slow on the GPUOff-load from GPU to CPUTrailing matrix updates are done on the GPULook-ahead helps in hiding communication and panel factorizationTrailing matrix.

A Hybrid Algorithm Example Left-looking hybrid Cholesky factorization in MAGMAThe difference with LAPACK – the 4 additional lines in redLine 8 (done on CPU) is overlapped with work on the GPU (from line 6)

LU Factorization (single GPU)

From Single to Multi-GPU Support Data distribution– 1-D block-cyclic distributionAlgorithm–––GPU holding current panel issendingit to CPUAll updates are done in parallel onthe GPUsLook-ahead is done with GPUholdingthe next panelnbGPU0GPU1GPU2GPU0.

LU Factorization: Multiple GPUsMatrix too largefor a single GPUmemory

Out of GPU Memory Algorithms Perform left-looking factorizations on sub-matricesthat fit in the GPU memory (using existing algorithms)The rest of the matrix stays on the CPULeft-looking versions minimize writing on the CPUFactoredsub-matricA1 on CPUTo befactoredsub-matrixA2 on GPUUntouchedpart of A2using2) Update A2 actortheupdatedA2usingexisting3) Factor the updated A2 using existinghybridhybridcodecode4)Copyfactored4) Copy eusingexistingexistingalgorithmsalgorithms

A New Generation of DLA -1 BLASLAPACK80'sBlocking,cache friendlyLevel-3 BLAS90'sDistributedmemoryPBLAS, MPImid 00'sMulticore,ManycoreDAG scheduler,tile data layout,extra kernelslate 00'sAcceleratedmulticoreHybridscheduler,hybrid kernelsScaLAPACKPLASMAMAGMAMAGMAHybrid Algorithms(heterogeneity friendly)FeaturesConceptRely on- hybrid scheduler- hybrid kernelsAbstractions

Hybrid Algorithms: One-Sided Transformations One-Sided Factorizations––– LUQR, andCholeskyHybridization––Panels (Level 2 BLAS) are factored on CPU using LAPACKTrailing matrix updates (Level 3 BLAS) are doneon the GPU using “look-ahead”

Hybrid Algorithms: Two-Sided Transformations Two-Sided Factorizations––– BidiagonalTridiagonalUpper Hessenbergsingular valuessymmetric/generalized eigenvaluesnon-symmetric eigenvaluesHybridization–Trailing matrix updates (Level 3 BLAS) are done on the GPU –Similar to the one-sided factorizationsPanels (Level 2 BLAS) are hybrid Operations with memory footprint restricted to the panel are done on CPUThe time consuming matrix-vector products involving the entire trailing matrixare done on the GPU

Additional 4x Speedup from Faster GPU BLASDSYTRD (symmetric tri-diag Reduction)Keenland 3 NVIDIA Fermi M2070 1.1 GHz 5.4 GiB; 2x6 Intel Xeon X5660 2.8 GHz 26 GiB300250DSYTRD MKL 12 coresDSYTRD 1 GPUDSYTRD 2 GPUsDSYTRD 3 GPUsDSYTRD 2-stages 1 GPUGflop/s20015010012 x speedup over 12 Intel coresKeeneland(X5660system, @2.8using oneGHz)node§50003 NVIDIA GPUs (M2070@ 1.1 GHz, 5.4 GB)§A two-stageapproachleading2 x 6 Intel Cores(X5660 @ 2.8GHz, 23 l intensity30000Matrix sizeA. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-GPU generalized eigensolver for electronicstructure calculations based on fine grained memory aware tasks, ICL Technical report, 03/2012.

Multi-GPU Two-Sided Factorizations Need HPC multi-GPU Level 2 BLAS (e.g., 50% of flops in thetridiagonal reduction) Performance of DSYMV on M2090's140120Gflop/s100CUBLAS1 GPU2 GPUs3 GPUs8060402000500010000150002000025000Matrix sizeT. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication on multipleGPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.

Hybrid Two-Sided Factorizations

From Fast BLAS to Fast TridiagonalizationPerformance of MAGMA DSYTRD on multi M2090 GPUs§ 50 % of the flops are in SYMV§ Memory bound, i.e. does not scalewell on multicore CPUs§ Use the GPU’s high memorybandwidth and optimized SYMV§ 8 x speedup over 12 Intel cores(X5660 @2.8 GHz)Keeneland system, using one node3 NVIDIA GPUs (M2070@ 1.1 GHz, 5.4 GB)2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector multiplication onmultiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL Technical report, 03/2012.

From Static to Dynamic Scheduling Static may stall in situations where work is availableHand tuned optimizationsHardware heterogeneityKernel heterogeneitySeparation of concernsDynamic Runtime System

Matrices Over Runtime Systems at Exascale MORSEMission statement:– "Design dense and sparse linear algebra methods that achieve the fastestpossible time to an accurate solution on large-scale Hybrid systems”Runtime challenges due to the ever growing hardware complexityAlgorithmic challenges to exploit the hardware capabilities to the fullestIntegrated into MAGMA software stack

MAGMA-MORSE: x86 Multiple GPUs Lessons Learned from PLASMANew high performance numerical kernelsStarPU Runtime System– Augonnet et. Al, INRIA, BordeauxUse of both: x86 and GPUs leads to Hybrid ComputationsSimilar to LAPACK in functionality

High Productivity: Sequential CodeFrom Sequential Nested-Loop Code to Parallel Executionfor (k 0; k min(MT, NT); k ) {zgeqrt(A[k;k], .);for (n k 1; n NT; n )zunmqr(A[k;k], A[k;n], .);for (m k 1; m MT; m ) {ztsqrt(A[k;k],,A[m;k], .);for (n k 1; n NT; n )ztsmqr(A[m;k], A[k;n], A[m;n], .);}}

High Productivity: Parallel CodeFrom Sequential Nested-Loop Code to Parallel Executionfor (k 0; k min(MT, NT); k ) {starPU Insert Task( &cl zgeqrt, A, k, k, .);for (n k 1; n NT; n )starPU Insert Task( &cl zunmqr( A, k, n, .);for (m k 1; m MT; m ) {starPU Insert Task( &cl ztsqrt( A m, k, .);for (n k 1; n NT; n )starPU Insert Task( &cl ztsmqr( A, m, n, k, .);}}

Contact Information and Generous SponsorsStan Tomovtomov@eecs.utk.eduMAGMA teamhttp://icl.cs.utk.edu/magma/PLASMA teamhttp://icl.cs.utk.edu/plasma/Collaborating partners University of Tennessee, KnoxvilleUniversity of California, BerkeleyUniversity of Colorado, DenverINRIA, France (StarPU team)KAUST, Saudi Arabia

MAGMA Methodology Overview A methodology to use all available resources: MAGMA uses hybridization methodology based on – Representing linear algebra algorithms as collections of tasks and data dependencies among them – Properly scheduling tasks' execution over multicore and GPU hardware components