Transcription
Deep Learning Cookbook: technologyrecipes to run deep learning workloadsNatalia Vassilieva, Sergey Serebryakov
Deep learning applicationsVisionSpeechTextOther Search & informationextraction Security/Videosurveillance Self-driving cars Medical imaging Robotics Interactive voiceresponse (IVR) systems Voice interfaces (Mobile,Cars, Gaming, Home) Security (speakeridentification) Health care Simultaneousinterpretation Search and rankingSentiment analysisMachine translationQuestion answeringRecommendation enginesAdvertisingFraud detectionAI challengesDrug discoverySensor data analysisDiagnostic support2
Deep learning ecosystemKerasSoftwareHardware3
How to pick the right hardware/software stack?Does one size fit all?4
Applications break downImagesTissue classification inmedical imagesVideoVideo surveillanceDetectionSpeechSpeech recognitionGenerationTextSentiment analysisSensorPredictive maintenanceOtherFraud detectionLook for a known object/patternGenerate contentClassificationAssign a label from a predefined set oflabelsAnomaly detectionLook for abnormal, unknown patterns5
Types of artificial neural networksTopology to fit data characteristicsImages:Convolutional (CNN)InputHiddenLayer 1HiddenLayer 2Speech, time series, sequences:Fully Connected (FC), Recurrent (RNN)HiddenLayer 3OutputInputHiddenLayer 1HiddenLayer 2HiddenLayer 3Output6
One size does NOT fit allApplicationData typeModel (topology of artificial neural network):- How many layers- How many neurons per layer- Connections between neurons (types of layers)Data size7
Popular modelsModel sizeTypeAlexNetCNN60,965,224233 MB0.7GoogleNetCNN6,998,55227 MB1.6VGG-16CNN138,357,544528 MB15.5VGG-19CNN143,667,240548 MB19.6ResNet50CNN25,610,26998 MB3.9ResNet101CNN44,654,608170 MB7.6ResNet152CNN60,344,387230 MB11.3Eng Acoustic ModelRNN34,678,784132 MB0.035TextCNNCNN151,6900.6 MB0.009(# params)Model size (MB)GFLOPsName(forward pass)8
Popular modelsModel sizeTypeAlexNetCNN60,965,224233 MB0.7GoogleNetCNN6,998,55227 MB1.6VGG-16CNN138,357,544528 MB15.5VGG-19CNN143,667,240548 MB19.6ResNet50CNN25,610,26998 MB3.9ResNet101CNN44,654,608170 MB7.6ResNet152CNN60,344,387230 MB11.3Eng Acoustic ModelRNN34,678,784132 MB0.035TextCNNCNN151,6900.6 MB0.009(# params)Model size (MB)GFLOPsName(forward pass)9
Compute requirementsNameTypeResNet152CNNModel size(# params)60,344,387Model size (MB)230 MBGFLOPs(forward pass)11.3Training data: 14M images (ImageNet)FLOPs per epoch: 3 11.3 10& 14 10( 5 10 ,1 epoch per hour: 140 TFLOPSTodayโs hardware:Google TPU2: 180 TFLOPS Tensor opsNVIDIA Tesla V100: 15 TFLOPS SP (30 TFLOPS FP16, 120 TFLOPS Tensor ops), 12 GB memoryNVIDIA Tesla P100: 10.6 TFLOPS SP, 16 GB memoryNVIDIA Tesla K40: 4.29 TFLOPS SP, 12 GB memoryNVIDIA Tesla K80: 5.6 TFLOPS SP (8.74 TFLOPS SP with GPU boost), 24 GB memoryINTEL Xeon Phi: 2.4 TFLOPS SP10
Model parallelismโ Can be achieved with scalable distributed matrix operationsโ Requires a certain compute/bandwidth ratioLetโs assume:๐๐พ๐ฝ๐;โ input size batch size output sizeโ compute power of the device (FLOPS)โ bandwidth (memory or interconnect)โ number of compute devices๐/0123452๐9 ;๐ ๐พ๐ฝ 4๐๐พ๐๐ 4 @5 2๐; ๐๐ฝABfor FP32โSUMMA: Scalable Universal Matrix Multiplication Algorithmโ,R.A. van de Geijn, J. Watts11
Model parallelismโ Can be achieved with scalable distributed matrix operationsโ Requires a certain compute/bandwidth ratioLetโs assume:๐๐พ๐ฝ๐;โ input size batch size output sizeโ compute power of the device (FLOPS)โ bandwidth (memory or interconnect)โ number of compute devices๐/0123452๐9 ;๐ ๐พ๐ฝ 4๐๐พ๐๐ 4 @5 2๐; ๐๐ฝ๐ 2000,ฮณ 15 ๐๐น๐ฟ๐๐๐for FP32๐ 10,๐ 1,๐ฝ 300 ๐บ๐ต/๐ ๐ฝ 30 ๐บ๐ต/๐ 12
Data parallelism๐/012345 (๐, ๐, ๐พ) ๐/(๐๐พ)๐/0113QR/ 45 (๐, ๐ค, ๐ฝ) ๏ฟฝ number of workers (nodes),โ the computational power of the node,โ the computational complexity of the model,โ bandwidth,โ the size of the weights in bits.13
Data parallelism๐/012345 (๐, ๐, ๐พ) ๐/(๐๐พ)๐/0113QR/ 45 (๐, ๐ค, ๐ฝ) ๏ฟฝ number of workers (nodes),โ the computational power of the node,โ the computational complexity of the model,โ bandwidth,โ the size of the weights in bits.NVIDIA K40 ( 4 TFLOPS), PCIe v3 ( 16 GB/s)14
Data parallelism๐/012345 (๐, ๐, ๐พ) ๐/(๐๐พ)๐/0113QR/ 45 (๐, ๐ค, ๐ฝ) ๏ฟฝ number of workers (nodes),โ the computational power of the node,โ the computational complexity of the model,โ bandwidth,โ the size of the weights in bits.NVIDIA K40 ( 4 TFLOPS), Infiniband ( 56 Gb/s)15
Deep Learning Cookbook helps to pick the right HW/SW stack- Benchmarking suite- Benchmarking scripts- Set of benchmarks (for core operations and reference models)- Performance measurements for a subset of applications, models and HW/SW stacks- 11 models- 8 frameworks- 6 hardware systems- Analytical performance and scalability models- Performance prediction for arbitrary models- Scalability prediction- Reference solutions, white papers16
17
Selected scalability resultsHPE Apollo 6500 (8 x NVIDIA P100)150.00100.00Batch size:6450.001280.00525.0020.00Batch size:15.003210.00645.001280.001005Number of GPUs200.0032100.00641280.0010Batch time (ms)Batch time (ms)Batch size:300.00Number of GPUs40.00Batch size:30.003220.006410.001280.0005VGG19 Weak Scaling800.00Batch size:600.00400.0016200.0032640.000246Number of GPUs10Number of GPUsVGG16 Weak Scaling400.00550.00Number of GPUsGoogleNet Weak Scaling010810Batch time (ms)0EngAcousticModel Weak ScalingBatch time (ms)DeepMNIST Weak ScalingBatch time (ms)Batch time (ms)AlexNet Weak Scaling1000.00Batch er of GPUs18
Selected observations and tipsโ Larger models are easier to scale (such as ResNet and VGG)โ A single GPU can hold only small batches (the rest of memory is occupied by a model)โ Fast interconnect is more important for less compute-intensive models (FCC)โ A rule of thumb: 1 or 2 CPU cores per GPUโ PCIe topology of the system is important19
Further into the future: neuromorphic research projectsNeuromorphic Computing โ the integration of algorithms,architectures, and technologies, informed by neuroscience, tocreate new computational approaches.โ Memristor Dot-Product Engine (DPE) โsuccessfully demonstratedโ Memristor crossbar analog vector-matrixmultiplication acceleratorโ Hopfield Network (electronic and photonic) โin progressIjO j Gij . ViIHewlett Packard Enterprise
Thank youNatalia Vassilievanvassilieva@hpe.com21
Machine translation Question answering Recommendation engines Advertising Fraud detection AI challenges Drug discovery Sensor data analysis Diagnostic support Deep learning applications 2 Vision Speech Text Other. Hardware Software Deep learning ecosystem 3 Keras. 4 How to pick the right hardware/software stack? Does one size fit all? Applications break down 5 .