Unified Memory For Data Analytics And Deep Learning - Nvidia

Transcription

UNIFIED MEMORY FOR DATAANALYTICS AND DEEP LEARNINGNikolay Sakharnykh, Chirayu Garg, and Dmitri Vainbrand, Thu Mar 19, 3:00 PM

RAPIDSCUDA-accelerated Data Science LibrariesDASK / UDAAPACHE ARROW on GPU Memory2

MORTGAGE PIPELINE: er/mortgage/E2E.ipynbCSVread CSVDFfilterjoingroupbyArrow3

MORTGAGE PIPELINE: PREP DMatrixXGboost4

GTC EU KEYNOTE RESULTS ON DGX-1Mortage workflow time breakdown on DGX-1 (s)140120100806040200ETLPREPML5

MAXIMUM MEMORY USAGE ON DGX-135Tesla V100 limit – 32GB3025GB2015105012345678GPU ID6

ETL tgage-dataoriginal input set112 quarters ( 2-3GB)CSVCSVCSVCSVCSVCSVCSV240 quarters (1GB)CSVCSVCSVCSVCSVCSVCSVCSV CSVCSVCSV7

GPU memory usage (GB) - ETL(112 parts)40Tesla V100 limit – 450CAN WE AVOID INPUT SPLITTING?GPU memory usage (GB) - ETL(original dataset)4035OOMCRASH08

ML INPUTSome # of quarters are used for ML st9

CAN WE TRAIN ON MORE DATA?GPU memory usage (GB) - PREP(112- 20 parts)Tesla V100 limit – PU memory usage (GB) - PREP(112- 28 parts)10

HOW MEMORY MANAGED IN RAPIDS11

RAPIDS MEMORY MANAGERhttps://github.com/rapidsai/rmmRAPIDS Memory Manager (RMM) is: A replacement allocator for CUDA Device Memory A pool allocator to make CUDA device memory allocation faster & asynchronous A central place for all device memory allocations in cuDF and other RAPIDS libraries12

WHY DO WE NEED MEMORY POOLScudaMalloc/cudaFree are synchronous block the devicecudaMalloc(&buffer, size in bytes);cudaFree(buffer);cudaMalloc/cudaFree are expensive cudaFree must zero memory for security cudaMalloc creates peer mappings for all GPUsUsing cnmem memory pool improves RAPIDS ETL time by 10x13

RAPIDS MEMORY MANAGER (RMM)Fast, Asynchronous Device Memory ManagementC/C RMM ALLOC(&buffer, size in bytes, stream id);RMM FREE(buffer, stream id);Python: drop-in replacementfor Numba APIThrust: device vector andexecution policiesdev ones rmm.device array(np.ones(count))dev twos rmm.device array like(dev ones)# also rmm.to device(), rmm.auto device(), etc.#include rmm thrust allocator.h rmm::device vector int dvec(size);thrust::sort(rmm::exec policy(stream)- on(stream), );14

MANAGING MEMORY IN THE E2E PIPELINEperf optimizationAt this point all ETLprocessing is done andmemory stored in arrowrequired to avoid OOMArrow15

KEY MEMORY MANAGEMENT QUESTIONS Can we make memory management easier? Can we avoid artificial pre-processing of input data? Can we train on larger datasets?16

SOLUTION: UNIFIED MEMORYEmptyGPU memoryFully OccupiedGPU memoryPartially OccupiedGPU ge on GPUPage on GPU (oversubscribed)CPU Memory17

HOW TO USE UNIFIED MEMORY IN CUDFfrom librmm cffi import librmm config as rmm cfgPythonrmm cfg.use pool allocator Truermm cfg.use managed memory True# default is False# default is False18

IMPLEMENTATION DETAILSRegular RMM allocation:if (rmm::Manager::usePoolAllocator()) {RMM tream));RMM CHECK CNMEM(cnmemMalloc(reinterpret cast void** (ptr), size, stream));}else if (rmm::Manager::useManagedMemory())RMM CHECK CUDA(cudaMallocManaged(reinterpret cast void** (ptr), size));elseRMM CHECK CUDA(cudaMalloc(reinterpret cast void** (ptr), size));Pool allocator (CNMEM):if (mFlags & CNMEM FLAGS MANAGED) {CNMEM DEBUG INFO("cudaMallocManaged(%lu)\n", size);CNMEM CHECK CUDA(cudaMallocManaged(&data, size));CNMEM CHECK CUDA(cudaMemPrefetchAsync(data, size, mDevice));}else {CNMEM DEBUG INFO("cudaMalloc(%lu)\n", size);CNMEM CHECK CUDA(cudaMalloc(&data, size));}19

1. UNSPLIT DATASET “JUST WORKS”GPU memory usage (GB) - ETL(original dataset) – cudaMallocGPU memory usage (GB) - ETL (originaldataset) - cudaMallocManaged100100909080807070606040Tesla V100 limit – 32GBOOMCRASH50mem used40pool 430373290354337964049430245554808506153145020

2. SPEED-UP ON CONVERSIONDGX-1 time (s)1401201008060ETL4640PREP36ML20020 quarterscudaMalloc20 quarterscudaMallocManaged25% speed-up on PREP!21

3. LARGER ML TRAINING SETDGX-1 time (s)160140120100ETL80PREP60MLOOM!4020020 quarterscudaMalloc20 quarterscudaMallocManaged28 quarterscudaMalloc28 quarterscudaMallocManaged22

UNIFIED MEMORY GOTCHAS1. UVM doesn’t work with CUDA IPC - careful when sharing data between processesWorkaround - separate (small) cudaMalloc pool for communication buffersIn the future it will work transparently with Linux HMM2. Yes, you can oversubscribe, but there is danger that it will just run very slowlyCapture Nsight or nvprof profiles to check eviction trafficIn the future RMM may show some warnings about this23

RECAPcarefully partition input dataJust to run the full pipeline on the GPU you needadjust memory pool options throughout the pipelinelimit training size to fit in memorymakes life easier for data scientists – less tweaking!Unified Memoryimproves performance – sometimes it’s faster to allocate less often & oversubscribeenables easy experiments with larger datasets24

MEMORY MANAGEMENT IN THE sNEXT BIGTHINGContribute to RAPIDS: https://github.com/rapidsai/cudfContribute to RMM: https://github.com/rapidsai/rmm25

UNIFIED MEMORYFOR DEEP LEARNING26

FROM ANALYTICS TO DEEP LEARNINGData PreparationMachine LearningDeep Learning27

PYTORCH INTEGRATIONPyTorch uses a caching allocator to manage GPU memorySmall allocations distributed from fixed buffer (for ex: 1 MB)Large allocations are dedicated cudaMalloc’sTrivial changeReplace cudaMalloc with cudaMallocManagedImmediately call cudaMemPrefetchAsync to allocate pages on GPUOtherwise cuDNN may select sub-optimal kernels28

PYTORCH ALLOCATOR VS RMMPyTorch Caching AllocatorRMMMemory pool to avoid synchronization onmalloc/freeMemory pool to avoid synchronization onmalloc/freeDirectly uses CUDA APIs for memoryallocationsUses Cnmem for memory allocation andmanagementPool size not fixedReserves half the available GPU memoryfor poolSpecific to PyTorch C libraryRe-usable across projects and withinterfaces for various languages29

WORKLOADSImage ModelsBN-ReLU-Conv 1x1BN-ReLU-Conv 3x3BN-ReLU-Conv 1x1ResNet-1001 DenseNet-264VNet30

WORKLOADSLanguage ModelsWord Language ModellingDictionary Size 33278Embedding Size 256LossSoftmaxFCLSTM units 256Back propagation through time 1408 and 2800LSTMEmbedding31

WORKLOADSBaseline Training Performance on V100-32GBModelFP16Batch SizeFP32Samples/secBatch 218255.8109143.1Vnet303.56153.4Lang Model-14083294.94077.9Lang Model-28001646.51835.7Optimal Batch Size Selected for High ThroughputAll results in this presentation are using PyTorch 1.0rc1, R418 driver, Tesla V100-32GB32

P16Batch 0Samples/secGPU OVERSUBSCRIPTIONUpto 3x Optimal Batch SizeDenseNet-264150100FP16Batch SizeFP3233

GPU OVERSUBSCRIPTIONFillCPUMemGPU 34

GPU OVERSUBSCRIPTIONEvictCPUMemGPU 35

GPU OVERSUBSCRIPTIONPage Fault-Evict-FetchCPUMemGPU 36

GPU OVERSUBSCRIPTIONResultsFP16ModelBatch SizeFP32Samples/secBatch 022.321812.1VNet323321.1Lang Model-1408448.44410Lang Model-2800224.1224.937

GPU OVERSUBSCRIPTIONPage Faults - ResNet-1001 Training IterationResNet-10011200000Page Fault Count1000000800000600000400000200000011.522.5Over Subscription (Batch Size / Optimal Batch Size)338

GPU OVERSUBSCRIPTIONManual API PrefetchAdd cudaMemPrefetchAsync before kernels are calledcudaMemPrefetchAsync( )// input, output, wparamcudnnConvolutionForward( -cudaMemPrefetchAsync( )// A, B, CkernelPointWiseApply3( )39

GPU OVERSUBSCRIPTIONNo Prefetch vs Manual API Prefetch40

GPU OVERSUBSCRIPTIONSpeed up from Manual API PrefetchNo Prefetch vs 0.40.20ResNet-1001DenseNet-264VNetFP16 PrefetchLang Model-1408Lang Model-2800FP32 PrefetchObserve upto 1.6x speed-up41

GPU OVERSUBSCRIPTIONPrefetch Only When NeededPrefetch memory before kernel toimprove performanceAutomatic prefetching needed toachieve high performance120100Samples/seccudaMemPrefetchAsync takes CPUcycles – degrades performance whennot requiredResNet-10018060402002 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 242 254 266 278 290Batch SizeFP16FP16 prefetch42

DRIVER PREFETCHAggressive driver prefetchingDriver initiated (density) prefetching from CPU to GPUGPU pages tracked as chunk of smaller sysmem pageDriver logic: Prefetch rest of the GPU page when 51% is migrated to GPUChange to 5%Observe up to 20% gain in performance vs default settings43

FRAMEWORK FUTUREFramework can develop intelligence to insert prefetch before calling GPU kernelsSmart evictions: Activation’sonly Lazy Prefetch: Catch kernelcalls right before executionand add prefetch callsnn.Conv2d( )Replace:nn.Prefetch( )nn.Conv2d( ) (Hook)Eager Prefetch - Identify andadd prefetch calls before thekernels are calledx*yW1*zW2Execute Parallelly44

TAKEAWAYUnified Memory oversubscription solves the memory pool fragmentation issueSimple way to train bigger models and on larger input dataMinimal user effort, no change in framework programmingFrameworks can get better performance by adding prefetch’sTry it out and /github.com/rapidsai/rmm45

CUDA-accelerated Data Science Libraries CUDA PYTHON APACHE ARROW on GPU Memory K cuDF cuML cuDNN DL RAPIDS FRAMEWORKS . NEXT BIG THING cuDF cuML. 26 UNIFIED MEMORY FOR DEEP LEARNING. 27 FROM ANALYTICS TO DEEP LEARNING Data Preparation Machine Learning Deep Learning. 28 PYTORCH INTEGRATION PyTorch uses a caching allocator to manage GPU memory