DRAGON: Breaking GPU Memory Capacity Limits With Direct .

Transcription

DRAGON: Breaking GPU MemoryCapacity Limits with Direct NVM AccessPak MarkthubT, Mehmet E. BelviranliO, Seyong LeeO,Jeffrey S. VetterO, and Satoshi MatusokaR,TTTokyoInstitute of TechnologyOOak Ridge National LaboratoryRRIKEN Center for Computational Science1

In a nutshell GPUs are largely used in HPC and ML Workload sizes and user productivity have been limited by GPU memory capacity Meanwhile, memory systems are evolving NVMs provide larger capacities at lower costs & power compared to Host and GPU mem In this study, we propose DRAGON: Enables GPU kernels to directly access NVMs Transparently provides massive memory to GPUs It is open source: CUDA driver extension User-level APIAvailable at https://github.com/pakmarkthub/dragon2

Motivation3

Memory systems started diversifying Architectures HMC, HBM/2/3, LPDDR4,GDDR5X, WIDEIO2, etc 2.5D, 3D Stacking Configurations Unified MemoryScratchpadsWrite through, write back, etcConsistency and coherence protocolsVirtual v. Physical, paging strategieshttps://www.micron.com/ /media/track-2-images/content-images/content image hmc.jpg?la 4665/988/406/011/788d3ba1967e2db3817d259d2e83c88e 1.jpg New devices ReRAM, PCRAM, STT-MRAM,3DXpointJ.S. Vetter and S. Mittal, “Opportunities for Nonvolatile Memory Systems in Extreme-Scale High PerformanceComputing,” CiSE, 17(2):73-82, 2015.4H.S.P. Wong, H.Y. Lee, S. Yu et al., “Metal-oxide RRAM,” Proceedings of the IEEE, 100(6):1951-70, 2012.

NVM are moving up in memory hierarchyImage Source: IMECNon-Volatile, Large Capacity, Low Cost, Low Power, High Speed5

GPUs are widely adopted Employ in supercomputers, clusters, and CloudsTSUBAME3.0 @ TITECH540 nodes; 4 P100 per nodeSummit @ ORNL4,608 nodes; 6 V100 per node Use by thousands of applications6

nProblem: GPU memory is too small!!Problemsizeshave grown larger thProblem sizes have grown larger than GPU andthe hostmem Workload sizes grow larger than GPU and host memory Complex GPU algorithms to handle out-of-core processingddeep neuralnetworkModifiedRelied on“ as the model grows in size, the size of a S“ as the model grows in size, the size of a SGD batch must be decreased (to fit inDirectlythe GPU memory) ” [A.Vedalthe GPU memory) ” [A.Vedaldi et al., ACMMM2016]data fromOn-chip (GPU Memory)ProgrammingComplexityStandard (cudaMalloc)Implementation Applicable broadlytechniqueTo support large problem size, GPUOff-chip (HostTo Memory)(I/O)support large problemOff-chipsize, GPU algorithmsbecome complexPrototypeComplexityLowDevelopment costSeveral man-hoursMaintainabilityUnderstandable for most GPU programmersImplementation tech- Broadlynique applicabilityPerformanceLowProblem sizeMust fit in GPU memoryData movementCopy input from files to GPU mem;execute; and copy output to filesPrototypeProduction(for Big Data) LowComplexitydragonErroHigh (dueto data movement)DevelopmentcostSeveral man-hoursunsign 100 man-hoursMaintainabilityUnderstandable for most GPUOnly for highly trained programmersdragonErrogrammersImplementation tech- BroadlydragonErroAlgorithm-specificnique applicabilityPerformanceLowOptimizatiHigh Problem sizeMust fit in GPU memoryD READData movementUnlimitedCopy input from files to GPU mexecute; and copy output to fileManual buffer management; overlap*Important assumption:Input and output dataare on files.ping computationwith data transferD WRITEStandard (UnifiedMemory)High (I/O interfaces,pipelines, overlapping )Applicable broadlyAlgorithm-specific*Important assumption: Input and output data are on files.PerformanceHighNVM: Non-volatile, large capacity,D VOLATUVMLower (due to NVM:PCI-e)Lowest(PCI-e I/O)Non-volatile, large capacity, and high IO bandwidthProblem sizeMust fit in GPU memoryMust fit in SystemNVMmemoryUnlimitedOn-demand & implicitdata copy, HW pagingManual buffervia I/O and CUDA callsData movement Host GPU; GPUkernel has direct accessHow cWith DRof NVdatabenefitmanagement cacNVMHow can GPU reap thebenefitof NVM while still keeping thealgorithm simple?(Source: http://electroni(Source: Can we use NVMs for GPUs without incurring applicationcomplexity while maintaining reasonable performance?7

Related Work Out-of-core Processing for GPU Many research papers and implementations Efficient but algorithm-specific NVIDIA’s Unified Memory (UM) On-demand paging for GPUs Cannot go beyond host memory due to page pinning Separate from storage space Need fread/fwrite!!! Other Hardware- and Software-based Approaches Ex. GPUfs, ActivePointers Prior work is mostly obsoleted by UM Many suggested impractical HW u-memory-limits-unifiedmemory-pascal/8

Solution & Contributions DRAGON: Direct Resource Access for GPU over NVM Enable “mmap” from NVM to GPU memory spaceUnified GPU memory, host memory, and NVM spaces with strong and week coherencyUtilize GPU HW page-faulting with our customized nvidia-uvm driverOverlap GPU computation with data transfer by nature (2.3x speedup over UM IO) Key Contributions: Transparently extends GPU memory space to NVM devices Lead to virtually unlimited GPU memory capacity No need to modify GPU kernels Eliminate manual buffer management on data larger than the system memory Present NVM-optimized data access pattern types to decrease IO overhead Evaluate functionality and performance on scientific and ML applications9

DRAGON – New solution toprovide transparent access tomassive NVM capacity10

DRAGON: OverviewOur workApplicationlibdragonLibrarymalloc / irectaccessDrivervfspage-cachenvmeGPUNVMExtend nvidia-uvm tosupport GPU load/storefrom/to NVMActivate via libdragoncall: dragon map()Fully compatible withoriginal CUDA API11

DRAGON Operations: Key Components Three memory spaces: GPU Mem (GM) as 1st level cache Host Mem (HM) as 2nd level cache NVM as primary storage Modified GPU driver Manage data movement & coherency GPU MMU with HW Page Fault Manage GPU virtual memory mapping Page cache Buffer & accelerate data access12

How Data Move: NVM GPU The GPU load operation triggersGPU page fault. DRAGON driver extension transfersdata NVM HM GM. Next data is prefetched on HM viapage-cache read-ahead. If cache hits on HM, data transferbecomes HM GM.Prefetching occurs while GPU iscomputing. Adaptive HM GM granularity 4 KiB, 64 KiB, 2 MiBIf cache hits on GM, no data transfer.13

How Data Move: GPU NVM The GPU store operation writes dataon Global Memory. If no free GPU chunk, GM HMon a victim chunk. Eviction policy follows UM.V100 does better than P100 due toHW counter. While GPU is computing, HM NVM by page-cache write-back. Eviction policy follows page-cache.Data is kept on the fastestpossible memory14

Optimizations Intermediate Data (Temporary Workspace)IN read(.);tmp compute0(IN);OUT compute1(tmp);write(OUT, ); Keep the data on the fastest-possible memory (GM HM NVM) Disable automatic write-back and flushing Read-Only and Write-Only Data One-directional data movement No write out and read in respectively Eviction Policies Intermediate Data: Least Recent Use (LRU) Input/Output Data: Follow UM and Page Cache Two-level Prefetching Host to GPU: 4 KiB, 64 KiB, 2 MiB NVM to Host: Follow Page Cache Read-Ahead17

Out-of-Core using CUDADRAGON: API and Integration// Allocate host & device memoryh buf malloc(size);cudaMalloc(&g buf, size);while() { // go over all chunks// Read-in dataf fopen(filepath, “r”);fread(h buf, size, 1, f);// H2D TransfercudaMemcpy(g buf, h buf, H2D);// GPU computecompute on gpu(g buf);// Transfer back to hostcudaMemcpy(h buf, g buf, D2H);compute on host(h buf);// Write out resultfwrite(h buf, size, 1, f);}DRAGON// mmap data to host and GPUdragon map(filepath, size,D READ D WRITE, &g buf);// Accessible on both host and GPUcompute on gpu(g buf);compute on host(g buf);// Implicitly called when program exitsdragon sync(g buf);dragon unmap(g buf);Notes Similar to NVIDIA’s Unified Memory (UM) Enable access to large memory on NVM UM is limited by host memory18

Evaluation on Scientific andML workloads19

EvaluationData movement methods1. cudaMemcpy fread/fwrite (original)2. cudaHostRegister mmap3. UM-P fread/fwrite (baseline)4. DRAGONBenchmark applicationsEnvironmentCPUDual 12-core Intel Xeon E5Memory64 GiB DDR 4GPUOne NVIDIA P100 (12 GiB) PCIeNVM2.4TB Micron 9100 HHHL U.2 PCIe NVMeformatted with ext4ConnectionPCIe gen.3 x16OSCentOS7 Kernel 3.10.0-693.5.2.el7.x86 64CUDAV9.0 with driver 384.81No change to the CUDA kernelsMeasured the entire executiontime including reading in inputfrom and writing out result to files20

Result: pathfinder (dynamic proc)LinearextrapolationFrom data streaming effect2.3xBetterDRAGON is theonly solution forout-of-core withuser-obliviousaddressing#1: Default#2: Hostreg#3: UM-P#4: DRAGONNormalizedw.r.t. UM-POther Methods: GPU is idle while fread() and cudaMemcpy()DRAGON: Overlap GPU computation with data transferThe CUDA kernel is exactly the same!!!22

Case Study: CaffeData movement methods1. cudaMemcpy fread/fwrite (Default)2. UM-P fread/fwrite3. CPU 4 threads (ATLAS)4. CPU all cores (OpenBLAS OMP)5. DRAGON (our solution)EnvironmentCPUDual 12-core Intel Xeon E5Memory24 GiB DDR 4GPUOne NVIDIA P100 (12 GiB) PCIeNVM2.4TB Micron 9100 HHHL U.2 PCIe NVMeformatted with ext4ConnectionPCIe gen.3 x16OSCentOS7 Kernel 3.10.0-693.5.2.el7.x86 64CUDAV9.0 with driver 384.81Methodology Modified Caffe to support UM-P and DRAGONChanged only data movement methods on host; No change to CUDA kernelsVaried memory footprint size by varying the problem size and input parametersMeasured execution time and normalized w.r.t. DRAGON24

(video action recognition; 3D Conv; batch size 1; ran 30 batches)Case Study: C3D-UCF101Net (Caffe)Host memEnable complexalgorithms thatrequire large datachunks to processOut-of-coreBetter3.5x faster than usingCPUs for out-of-coreGPU memDRAGON didn’t loss tooriginal CaffeNormalizedw.r.t. DRAGON 26

ConclusionDRAGON: Efficiently and transparently maps files on NVMs to GPUs No need to modify CUDA kernels Enables transparent off-chip memory access for large data with no need todo manual buffering nor modify GPU algorithms Significantly improves application execution times with Direct page-cache access and intermediate data handling optimization Implicitly overlapping computation with data transfer with read-ahead and write-backAvailable at https://github.com/pakmarkthub/dragon27

Acknowledgement This research was supported in part by an appointment to the Oak Ridge NationalLaboratory ASTRO Program, sponsored by the U.S. Department of Energy andadministered by the Oak Ridge Institute for Science and Education. This work was also partially supported by JST CREST Grant NumbersJPMJCR1303 (EBD CREST) and JPMJCR1687 (DEEP CREST), and performedunder the auspices of Real-World Big-Data Computation Open InnovationLaboratory (RWBC-OIL), Japan.28

Enabled GPUs and CPUs to access the same file-backed mapped virtual addresses. Support UM data consistency down to NVM. Fully compatible with UM without performance penalty. tData accesses are naturally streaming from multi-level prefetching; even with simple GPU algorithms benefit from good overlapping computation and data transfer. Architecture