Deep Learning/AI Lifecycle With Dell EMC And Bitfusion

1y ago

25 Views

1 Downloads

8.90 MB

38 Pages

Report/dmca

Download PDF

Transcription

Deep Learning/AILifecycle with DellEMC and bitfusionBhavesh PatelDell EMC Server Advanced Engineering

AbstractThis talk gives an overview of the end to end application life cycle ofdeep learning in the enterprise along with numerous use cases andsummarizes studies done by Bitfusion and Dell on a high performanceheterogeneous elastic rack of DellEMC PowerEdge C4130s with NvidiaGPUs. Some of the use cases that will be talked about in detail will beability to bring on-demand GPU acceleration beyond the rack acrossthe enterprise with easy attachable elastic GPUs for deep learningdevelopment, as well as the creation of a cost effective software definedhigh performance elastic multi-GPU system combining multipleDellEMC C4130 servers at runtime for deep learning training.

Deep Learning and AI Are being adoptedacross a wide range of market segments

Industry/Function AI ARMAHEALTHCAREENERGYEDUCATIONSALESSUPPLY CHAINCUSTOMER SERVICEMAINTENANCEComputer Vision & Speech, Drones, DroidsInteractive Virtual & Mixed RealitySelf-Driving Cars, Co-Pilot AdvisorPredictive Price Analysis, Dynamic DecisionSupportDrug Discovery, Protein SimulationPredictive Diagnosis, Wearable IntelligenceGeo-Seismic Resource DiscoveryAdaptive Learning CoursesAdaptive Product RecommendationsDynamic Routing OptimizationBots And Fully-Automated ServiceDynamic Risk Mitigation And Yield Optimization

.but few people have the time,knowledge, resources to even get started

PROBLEM 1: HARDWARE INFRASTRUCTURE LIMITATIONS Increased cost with dense servers TOR bottleneck, limited scalability Limited multi-tenancy on GPUservers (limited CPU and memoryper user) Limited to 8-GPU applications Does not support GPU apps with: High storage, CPU, Memoryrequirements

PROBLEM 2: SOFTWARE COMPLEXITY OVERLOADSoftware ManagementGPU Driver ManagementFramework & Library InstallationDeep Learning Framework ConfigurationPackage ManagerJupyter Server or IDE SetupInfrastructure ManagementCloud or Server OrchestrationGPU Hardware SetupGPU Resource AllocationContainer OrchestrationNetworking Direct BypassMPI / RDMA / RPI / gRPCMonitoringModel ManagementCode Version ManagementHyperparameter OptimizationExperiment TrackingDeployment AutomationDeployment Continuous IntegrationData ManagementData UploaderShared Local File SystemData Volume ManagementData Integrations & PipeliningWorkload ManagementJob SchedulerLog ManagementUser & Group ManagementInference Autoscaling

Need to Simplify andScale

SOLUTION 1/2: CONVERGED RACK SOLUTION Up to 64 GPUs per application GPU applications with varied storage,memory, CPU requirements 30-50% less cost per GPU {cores, memory} / GPU intra-rack networking bandwidth Less inter-rack load Composable - Add-as-you-goComposable compute bundle

SOLUTION 2/2: COMPLETE, STREAMLINED AI DEVELOPMENTDevelop on pre-installed, quickstart deep learning containers. Get to work quickly withworkspaces with optimized preconfigured drivers, frameworks,libraries, and notebooks. Start with CPUs, and attach ElasticGPUs on-demand. All your code and data is savedautomatically and sharable UGPUGPUTransition from developmentto training with multipleGPUs. Seamlessly scale out to moreGPUs on a shared training clusterto train larger models quickly andcost-effectively. Support and manage multipleusers, teams, and projects. Train multiple models in parallelfor massive productivityPush trained, finalized modelsinto production. Deploy a trained neural networkinto production and perform realtime inference across differenthardware. Manage multiple AI applicationsand inference endpointscorresponding to different trainedmodels.

Dell EMC Deep Learning Optimized eleratorGPUNvLink-GPUKNL Phi inC6320P SledComputePlatformC413012R730C6320P in C6300

C4130 DEEP LEARNING ServerPowerSuppliesCPU sockets(under heatsinks)Front8 fansBackIDRAC NICDual SSDbootdrives2x 1GbNICGPUaccelerators(4)(optional) RedundantPower SuppliesFront

GPU DEEP LEARNING RACK SOLUTION- Pre-Built AppContainers- GPU and WorkspaceManagement- Elastic GPUs across theDatacenter- Software definedScaled out GPU ServersR730Configuration DetailsFeaturesR730C4130CPUE5-2669 v3@2.1GHzE5-2630 v3@ 2.4GhzMemory4GB1TB/node; 64G DIMMStorageIntel PCIe NVMEIntel PCIe NVMENetworking IOCX3 FDR InfiniBandCX3 FDR InfiniBandGPUNAM40-24GBTOR SwitchCablesC4130Mellanox SX6036- FDR SwitchFDR 56G DCA Cables

GPU DEEP LEARNING RACK SOLUTIONEnd to End Deep Learning Application Life Cycle- Pre-Built App Containers- GPU and WorkspaceManagement- Elastic GPUs across theDatacenter- Software defined Scaled outGPU ServersCPU NodesGPU NodesC4130 #1R730 #1C4130 #2R730 #2C4130 #3C4130 #4Infiniband Switch1 Develop2 U3Deploy

but wait, ‘converged compute’requires network attached GPUs.R730C4130

BITFUSION CORE VIRTUALIZATIONGPU Device Virtualization Allows dynamic GPU attach on a perapplication basisFeatures APIs: CUDA, OpenCL Distribution: scale-out to remote GPUs Pooling: Oversubscribe GPUs Resource Provisioning: Fractional vGPUs High Availability: Automatic DMR Manageability: Remote nvidia-smi Distributed CUDA Unified Memory Native support for IB, GPUDirect RDMA Feature complete with CUDA 8.0

PUTTING IT ALL TOGETHERBitfusion Flex,managed containersGPUSERVERCLIENT SERVERBitfusion Client LibraryGPUSERVERGPUSERVERBitfusion Service Daemon

NATIVE VS. REMOTE GPUsGPU 0CPUCPUCPUPCIePCIePCIeGPU 1GPU 0HCAHCACompletely transparent: All CUDA Apps see local and remote GPUs as if directly connectedGPU 1

Results

REMOTE GPUs - LATENCY AND BANDWIDTH Data movement overheads is the primary scalinglimiter Measurements done at application level –cudaMemcpyFast Local GPU copiesPCIe Intranode copies

16 GPU virtual system: Naive implementation w/ TCP/IPnode 0node 1node 2C4130Fast local GPU copiesIntranode copies via PCIenode 3Low BW, High Latency remote copiesOS Bypass needed to avoidprimary TCP/IP overheadsAI apps are very latency sensitive

16 GPU virtual system: Bitfusion optimized transport and runtime Same FDRx4 transport, but drop IPoIB Replace remote calls with native IB verbs Runtime selection of intranode RDMA vs.cudaMemcpy Multi-rail communications where availableRemote Native Local GPUsMinimal NUMA effects Runtime optimizations: pipelining, speculativeexecution, distributed caching & event coalescing,

SLICE & DICE - MORE THAN ONE WAY TO GET 4 GPUsR730C4130Native GPU performance with networkattached GPUsRun time comparison (lower is better) Multiple ways to create a virtual 4 GPU node, with nativeefficiency(secs to train Caffe GoogleNet, batch size: 128)Caffe GoogleNetTensorFlowPixel-CNN

TRAINING PERFORMANCER730C4130Continued Strong ScalingCaffe GoogleNetWeak-scalingAccelerate Hyper parameter OptimizationPCIe host bridge limitCaffe GoogleNetTensorFlow 1.0 with Pixel-CNN53%55%73%74%86%12native4816remote

Other PCIe GPU Configurations AvailableCurrently TestingFurther gh-performance-computing/b/general er/high-performance-computing/b/general p40-gpusConfig ‘G’

NvLink Configuration 4 P100-16GB SXM2 GPU 2 CPU PCIe switch 1 PCIe slot – EDR IBConfig ‘K’SXM2#3SXM2#2SXM2#4SXM2#1of30Y

NvLink Configuration 4 P100-16GB SXM2 GPU 2 CPU PCIe switchPCIe SwitchSXM2#3SXM2#2SXM2#4SXM2#1Config ‘L’of31Y 1 PCIe slot – EDR IB Memory : 256GB w/16GB@ 2133 OS: Ubuntu 16.04 CUDA: 8.1

Software Solutions

Overview – Bright ML Dell EMC has partnered with Bright Computing to offer their Bright ML package as the software stack on Dell EMC Deep learning hardwaresolution.of3319

Bright ML Overviewof3419

Machine Learning in SeismicImaging Using KNL FPGA– Project # 1Bhavesh Patel – Server Advanced EngineeringRobert Dildy - Product Technologist Sr. Consultant,Engineering Solutions

AbstractThis paper is focused on how to apply Machine Learning to seismic imaging with the use of FPGA as a coaccelerator.It will cover 2 hardware technologies: 1) Intel KNL Phi 2) FPGA and also address how to use Machine learning forseismic imaging.There are different types of accelerators like GPU, Intel Phi but we are choosing to study how we can use i-ABRAplatform on KNL FPGA to train the neural network using Seismic Imaging data and then doing the inference.Machine learning in a broader sense can be divided into 2 parts namely : Training and Inference.36

BackgroundSeismic Imaging is a standard data processing technique used in creating an image of subsurface structures ofthe Earth from measurements recorded at the surface via seismic wave propagations captured from varioussound energy sources.There are certain challenges with Seismic data interpretation like 3D is starting to replace 2D for seismicinterpretation.There has been rapid growth in use of computer vision technology & several companies developing imagerecognition platforms. This technology is being used for automatic photo tagging and classification. The sameconcept could be applied to identify geometric patterns in the data and generate image captions/descriptions. Wecan use Convolutional Neural Networks (CNN) to learn visual concepts using massive amounts of data whichwould help in doing objective analysis of it.The use of machine learning and image processing algorithms to analyze, recognize and understand visualcontent would allow us to analyze data both in Supervised neural networks(SNN) and unsupervised neuralnetworks (UNN) like CNN.37

Observing both plane and cross-sectionSeismic Stratigraphic image learning38Seismic Geomorphology image learning

Models in plane and cross-section39

Tags with ‘facies’ recognition40

SolutionFor this paper we will be using the following Hardware and Software platforms:Hardware Platform:-C6320P Sleds with Intel KNL Phi Intel Arria 10 (A10PL4) FPGA adapter.Software Platform:-i-ABRA Deep learning frameworkThis will be a joint collaboration with :-Dell EMC-Intel-i-ABRA-Seismic Imaging firm - TBD41

Thank You

Cloud or Server Orchestration GPU Hardware Setup GPU Resource Allocation . Bitfusion Client Library. NATIVE VS. REMOTE GPUs CPUCPU GPU 0GPU 0 GPU 1GPU 1 PCIePCIe CPUCPU GPU 0GPU 0 HCAHCA . Dell EMC has partnered with Bright Computing to offer their Bright ML package as the software stack on Dell EMC Deep learning hardware